|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6 |
|
| #
75cb1cca |
| 08-May-2025 |
Barry Song <[email protected]> |
mm: userfaultfd: correct dirty flags set for both present and swap pte
As David pointed out, what truly matters for mremap and userfaultfd move operations is the soft dirty bit. The current comment
mm: userfaultfd: correct dirty flags set for both present and swap pte
As David pointed out, what truly matters for mremap and userfaultfd move operations is the soft dirty bit. The current comment and implementation—which always sets the dirty bit for present PTEs and fails to set the soft dirty bit for swap PTEs—are incorrect. This could break features like Checkpoint-Restore in Userspace (CRIU).
This patch updates the behavior to correctly set the soft dirty bit for both present and swap PTEs in accordance with mremap.
Link: https://lkml.kernel.org/r/[email protected] Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Barry Song <[email protected]> Reported-by: David Hildenbrand <[email protected]> Closes: https://lore.kernel.org/linux-mm/[email protected]/ Acked-by: Peter Xu <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
41e6ddca |
| 21-Mar-2025 |
Lorenzo Stoakes <[email protected]> |
mm/vma: add give_up_on_oom option on modify/merge, use in uffd release
Currently, if a VMA merge fails due to an OOM condition arising on commit merge or a failure to duplicate anon_vma's, we report
mm/vma: add give_up_on_oom option on modify/merge, use in uffd release
Currently, if a VMA merge fails due to an OOM condition arising on commit merge or a failure to duplicate anon_vma's, we report this so the caller can handle it.
However there are cases where the caller is only ostensibly trying a merge, and doesn't mind if it fails due to this condition.
Since we do not want to introduce an implicit assumption that we only actually modify VMAs after OOM conditions might arise, add a 'give up on oom' option and make an explicit contract that, should this flag be set, we absolutely will not modify any VMAs should OOM arise and just bail out.
Since it'd be very unusual for a user to try to vma_modify() with this flag set but be specifying a range within a VMA which ends up being split (which can fail due to rlimit issues, not only OOM), we add a debug warning for this condition.
The motivating reason for this is uffd release - syzkaller (and Pedro Falcato's VERY astute analysis) found a way in which an injected fault on allocation, triggering an OOM condition on commit merge, would result in uffd code becoming confused and treating an error value as if it were a VMA pointer.
To avoid this, we make use of this new VMG flag to ensure that this never occurs, utilising the fact that, should we be clearing entire VMAs, we do not wish an OOM event to be reported to us.
Many thanks to Pedro Falcato for his excellent analysis and Jann Horn for his insightful and intelligent analysis of the situation, both of whom were instrumental in this fix.
Link: https://lkml.kernel.org/r/[email protected] Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected]/ Fixes: 47b16d0462a4 ("mm: abort vma_modify() on merge out of memory failure") Signed-off-by: Lorenzo Stoakes <[email protected]> Suggested-by: Pedro Falcato <[email protected]> Suggested-by: Jann Horn <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
7440adb4 |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
With upcoming replacement of vm_lock with vm_refcnt, we need to handle a possibility of vma_start_read_locked/vma_start_read_lock
mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
With upcoming replacement of vm_lock with vm_refcnt, we need to handle a possibility of vma_start_read_locked/vma_start_read_locked_nested failing due to refcount overflow. Prepare for such possibility by changing these APIs and adjusting their users.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Cc: Lokesh Gidra <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Vlastimil Babka <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
b2ae5fcc |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: introduce vma_start_read_locked{_nested} helpers
Patch series "reimplement per-vma lock as a refcount", v10.
Back when per-vma locks were introduces, vm_lock was moved out of vm_area_struct in
mm: introduce vma_start_read_locked{_nested} helpers
Patch series "reimplement per-vma lock as a refcount", v10.
Back when per-vma locks were introduces, vm_lock was moved out of vm_area_struct in [1] because of the performance regression caused by false cacheline sharing. Recent investigation [2] revealed that the regressions is limited to a rather old Broadwell microarchitecture and even there it can be mitigated by disabling adjacent cacheline prefetching, see [3].
Splitting single logical structure into multiple ones leads to more complicated management, extra pointer dereferences and overall less maintainable code. When that split-away part is a lock, it complicates things even further. With no performance benefits, there are no reasons for this split. Merging the vm_lock back into vm_area_struct also allows vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
This patchset:
1. moves vm_lock back into vm_area_struct, aligning it at the cacheline boundary and changing the cache to be cacheline-aligned to minimize cacheline sharing;
2. changes vm_area_struct initialization to mark new vma as detached until it is inserted into vma tree;
3. replaces vm_lock and vma->detached flag with a reference counter;
4. regroups vm_area_struct members to fit them into 3 cachelines;
5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their reuse and to minimize call_rcu() calls.
Pagefault microbenchmarks show performance improvement: Hmean faults/cpu-1 507926.5547 ( 0.00%) 506519.3692 * -0.28%* Hmean faults/cpu-4 479119.7051 ( 0.00%) 481333.6802 * 0.46%* Hmean faults/cpu-7 452880.2961 ( 0.00%) 455845.6211 * 0.65%* Hmean faults/cpu-12 347639.1021 ( 0.00%) 352004.2254 * 1.26%* Hmean faults/cpu-21 200061.2238 ( 0.00%) 229597.0317 * 14.76%* Hmean faults/cpu-30 145251.2001 ( 0.00%) 164202.5067 * 13.05%* Hmean faults/cpu-48 106848.4434 ( 0.00%) 120641.5504 * 12.91%* Hmean faults/cpu-56 92472.3835 ( 0.00%) 103464.7916 * 11.89%* Hmean faults/sec-1 507566.1468 ( 0.00%) 506139.0811 * -0.28%* Hmean faults/sec-4 1880478.2402 ( 0.00%) 1886795.6329 * 0.34%* Hmean faults/sec-7 3106394.3438 ( 0.00%) 3140550.7485 * 1.10%* Hmean faults/sec-12 4061358.4795 ( 0.00%) 4112477.0206 * 1.26%* Hmean faults/sec-21 3988619.1169 ( 0.00%) 4577747.1436 * 14.77%* Hmean faults/sec-30 3909839.5449 ( 0.00%) 4311052.2787 * 10.26%* Hmean faults/sec-48 4761108.4691 ( 0.00%) 5283790.5026 * 10.98%* Hmean faults/sec-56 4885561.4590 ( 0.00%) 5415839.4045 * 10.85%*
This patch (of 18):
Introduce helper functions which can be used to read-lock a VMA when holding mmap_lock for read. Replace direct accesses to vma->vm_lock with these new helpers.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
927e926d |
| 26-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
userfaultfd: fix PTE unmapping stack-allocated PTE copies
Current implementation of move_pages_pte() copies source and destination PTEs in order to detect concurrent changes to PTEs involved in the
userfaultfd: fix PTE unmapping stack-allocated PTE copies
Current implementation of move_pages_pte() copies source and destination PTEs in order to detect concurrent changes to PTEs involved in the move. However these copies are also used to unmap the PTEs, which will fail if CONFIG_HIGHPTE is enabled because the copies are allocated on the stack. Fix this by using the actual PTEs which were kmap()ed.
Link: https://lkml.kernel.org/r/[email protected] Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <[email protected]> Reported-by: Peter Xu <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Barry Song <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
37b338ee |
| 26-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
userfaultfd: do not block on locking a large folio with raised refcount
Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock state when it goes into split_folio() with raised fo
userfaultfd: do not block on locking a large folio with raised refcount
Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock state when it goes into split_folio() with raised folio refcount. split_folio() expects the reference count to be exactly mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with EAGAIN otherwise.
If multiple processes are trying to move the same large folio, they raise the refcount (all tasks succeed in that) then one of them succeeds in locking the folio, while others will block in folio_lock() while keeping the refcount raised. The winner of this race will proceed with calling split_folio() and will fail returning EAGAIN to the caller and unlocking the folio. The next competing process will get the folio locked and will go through the same flow. In the meantime the original winner will be retried and will block in folio_lock(), getting into the queue of waiting processes only to repeat the same path. All this results in a livelock.
An easy fix would be to avoid waiting for the folio lock while holding folio refcount, similar to madvise_free_huge_pmd() where folio lock is acquired before raising the folio refcount. Since we lock and take a refcount of the folio while holding the PTE lock, changing the order of these operations should not break anything.
Modify move_pages_pte() to try locking the folio first and if that fails and the folio is large then return EAGAIN without touching the folio refcount. If the folio is single-page then split_folio() is not called, so we don't have this issue. Lokesh has a reproducer [1] and I verified that this change fixes the issue.
[1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock
[[email protected]: reflow comment to 80 cols, s/end/end up/] Link: https://lkml.kernel.org/r/[email protected] Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <[email protected]> Reported-by: Lokesh Gidra <[email protected]> Reviewed-by: Peter Xu <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Barry Song <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
c50f8e60 |
| 26-Feb-2025 |
Barry Song <[email protected]> |
mm: fix kernel BUG when userfaultfd_move encounters swapcache
userfaultfd_move() checks whether the PTE entry is present or a swap entry.
- If the PTE entry is present, move_present_pte() handles f
mm: fix kernel BUG when userfaultfd_move encounters swapcache
userfaultfd_move() checks whether the PTE entry is present or a swap entry.
- If the PTE entry is present, move_present_pte() handles folio migration by setting:
src_folio->index = linear_page_index(dst_vma, dst_addr);
- If the PTE entry is a swap entry, move_swap_pte() simply copies the PTE to the new dst_addr.
This approach is incorrect because, even if the PTE is a swap entry, it can still reference a folio that remains in the swap cache.
This creates a race window between steps 2 and 4. 1. add_to_swap: The folio is added to the swapcache. 2. try_to_unmap: PTEs are converted to swap entries. 3. pageout: The folio is written back. 4. Swapcache is cleared. If userfaultfd_move() occurs in the window between steps 2 and 4, after the swap PTE has been moved to the destination, accessing the destination triggers do_swap_page(), which may locate the folio in the swapcache. However, since the folio's index has not been updated to match the destination VMA, do_swap_page() will detect a mismatch.
This can result in two critical issues depending on the system configuration.
If KSM is disabled, both small and large folios can trigger a BUG during the add_rmap operation due to:
page_pgoff(folio, page) != linear_page_index(vma, address)
[ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 [ 13.337716] memcg:ffff00000405f000 [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) [ 13.340190] ------------[ cut here ]------------ [ 13.340316] kernel BUG at mm/rmap.c:1380! [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 13.340969] Modules linked in: [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 [ 13.341470] Hardware name: linux,dummy-virt (DT) [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 [ 13.342018] sp : ffff80008752bb20 [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f [ 13.343876] Call trace: [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 [ 13.344333] do_swap_page+0x1060/0x1400 [ 13.344417] __handle_mm_fault+0x61c/0xbc8 [ 13.344504] handle_mm_fault+0xd8/0x2e8 [ 13.344586] do_page_fault+0x20c/0x770 [ 13.344673] do_translation_fault+0xb4/0xf0 [ 13.344759] do_mem_abort+0x48/0xa0 [ 13.344842] el0_da+0x58/0x130 [ 13.344914] el0t_64_sync_handler+0xc4/0x138 [ 13.345002] el0t_64_sync+0x1ac/0x1b0 [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) [ 13.345504] ---[ end trace 0000000000000000 ]--- [ 13.345715] note: a.out[107] exited with irqs disabled [ 13.345954] note: a.out[107] exited with preempt_count 2
If KSM is enabled, Peter Xu also discovered that do_swap_page() may trigger an unexpected CoW operation for small folios because ksm_might_need_to_copy() allocates a new folio when the folio index does not match linear_page_index(vma, addr).
This patch also checks the swapcache when handling swap entries. If a match is found in the swapcache, it processes it similarly to a present PTE. However, there are some differences. For example, the folio is no longer exclusive because folio_try_share_anon_rmap_pte() is performed during unmapping. Furthermore, in the case of swapcache, the folio has already been unmapped, eliminating the risk of concurrent rmap walks and removing the need to acquire src_folio's anon_vma or lock.
Note that for large folios, in the swapcache handling path, we directly return -EBUSY since split_folio() will return -EBUSY regardless if the folio is under writeback or unmapped. This is not an urgent issue, so a follow-up patch may address it separately.
[[email protected]: minor cleanup according to Peter Xu] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Barry Song <[email protected]> Acked-by: Peter Xu <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Al Viro <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuah Khan <[email protected]> Cc: ZhangPeng <[email protected]> Cc: Tangquan Zheng <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2 |
|
| #
dd95d278 |
| 04-Dec-2024 |
Qi Zheng <[email protected]> |
mm: userfaultfd: recheck dst_pmd entry in move_pages_pte()
In move_pages_pte(), since dst_pte needs to be none, the subsequent pte_same() check cannot prevent the dst_pte page from being freed concu
mm: userfaultfd: recheck dst_pmd entry in move_pages_pte()
In move_pages_pte(), since dst_pte needs to be none, the subsequent pte_same() check cannot prevent the dst_pte page from being freed concurrently, so we also need to abtain dst_pmdval and recheck pmd_same(). Otherwise, once we support empty PTE page reclaimation for anonymous pages, it may result in moving the src_pte page into the dts_pte page that is about to be freed by RCU.
[[email protected]: remove WARN_ON_ONCE()s] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/8108c262757fc492626f3a2ffc44b775f2710e16.1733305182.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Muchun Song <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Cc: Zach O'Keefe <[email protected]> Cc: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3 |
|
| #
6359c39c |
| 10-Oct-2024 |
Kefeng Wang <[email protected]> |
mm: remove unused hugepage for vma_alloc_folio()
The hugepage parameter was deprecated since commit ddc1a5cbc05d ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"), for PMD-sized THP, it
mm: remove unused hugepage for vma_alloc_folio()
The hugepage parameter was deprecated since commit ddc1a5cbc05d ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"), for PMD-sized THP, it still tries only preferred node if possible in vma_alloc_folio() by checking the order of the folio allocation.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc2, v6.12-rc1 |
|
| #
e9c74b54 |
| 26-Sep-2024 |
Qi Zheng <[email protected]> |
mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring the ptl, so convert it to using pte_offset_map_rw_nolock()
mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring the ptl, so convert it to using pte_offset_map_rw_nolock(). But since we will use pte_same() to detect the change of the pte entry, there is no need to get pmdval, so just pass a dummy variable to it.
Link: https://lkml.kernel.org/r/1530e8fdbfc72eacf3b095babe139ce3d715600a.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7 |
|
| #
e1e4cfd0 |
| 03-Sep-2024 |
Rik van Riel <[email protected]> |
mm,tmpfs: consider end of file write in shmem_is_huge
Take the end of a file write into consideration when deciding whether or not to use huge pages for tmpfs files when the tmpfs filesystem is moun
mm,tmpfs: consider end of file write in shmem_is_huge
Take the end of a file write into consideration when deciding whether or not to use huge pages for tmpfs files when the tmpfs filesystem is mounted with huge=within_size
This allows large writes that append to the end of a file to automatically use large pages.
Doing 4MB sequential writes without fallocate to a 16GB tmpfs file with fio. The numbers without THP or with huge=always stay the same, but the performance with huge=within_size now matches that of huge=always.
huge before after 4kB pages 1560 MB/s 1560 MB/s within_size 1560 MB/s 4720 MB/s always: 4720 MB/s 4720 MB/s
[[email protected]: coding-style cleanups] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rik van Riel <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Tested-by: Baolin Wang <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
a17c7d8f |
| 29-Jul-2024 |
Lorenzo Stoakes <[email protected]> |
userfaultfd: move core VMA manipulation logic to mm/userfaultfd.c
Patch series "Make core VMA operations internal and testable", v4.
There are a number of "core" VMA manipulation functions implemen
userfaultfd: move core VMA manipulation logic to mm/userfaultfd.c
Patch series "Make core VMA operations internal and testable", v4.
There are a number of "core" VMA manipulation functions implemented in mm/mmap.c, notably those concerning VMA merging, splitting, modifying, expanding and shrinking, which logically don't belong there.
More importantly this functionality represents an internal implementation detail of memory management and should not be exposed outside of mm/ itself.
This patch series isolates core VMA manipulation functionality into its own file, mm/vma.c, and provides an API to the rest of the mm code in mm/vma.h.
Importantly, it also carefully implements mm/vma_internal.h, which specifies which headers need to be imported by vma.c, leading to the very useful property that vma.c depends only on mm/vma.h and mm/vma_internal.h.
This means we can then re-implement vma_internal.h in userland, adding shims for kernel mechanisms as required, allowing us to unit test internal VMA functionality.
This testing is useful as opposed to an e.g. kunit implementation as this way we can avoid all external kernel side-effects while testing, run tests VERY quickly, and iterate on and debug problems quickly.
Excitingly this opens the door to, in the future, recreating precise problems observed in production in userland and very quickly debugging problems that might otherwise be very difficult to reproduce.
This patch series takes advantage of existing shim logic and full userland maple tree support contained in tools/testing/radix-tree/ and tools/include/linux/, separating out shared components of the radix tree implementation to provide this testing.
Kernel functionality is stubbed and shimmed as needed in tools/testing/vma/ which contains a fully functional userland vma_internal.h file and which imports mm/vma.c and mm/vma.h to be directly tested from userland.
A simple, skeleton testing implementation is provided in tools/testing/vma/vma.c as a proof-of-concept, asserting that simple VMA merge, modify (testing split), expand and shrink functionality work correctly.
This patch (of 4):
This patch forms part of a patch series intending to separate out VMA logic and render it testable from userspace, which requires that core manipulation functions be exposed in an mm/-internal header file.
In order to do this, we must abstract APIs we wish to test, in this instance functions which ultimately invoke vma_modify().
This patch therefore moves all logic which ultimately invokes vma_modify() to mm/userfaultfd.c, trying to transfer code at a functional granularity where possible.
[[email protected]: fix user-after-free in userfaultfd_clear_vma()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/50c3ed995fd81c45876c86304c8a00bf3e396cfd.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
4828d207 |
| 13-Aug-2024 |
Jann Horn <[email protected]> |
userfaultfd: don't BUG_ON() if khugepaged yanks our page table
Since khugepaged was changed to allow retracting page tables in file mappings without holding the mmap lock, these BUG_ON()s are wrong
userfaultfd: don't BUG_ON() if khugepaged yanks our page table
Since khugepaged was changed to allow retracting page tables in file mappings without holding the mmap lock, these BUG_ON()s are wrong - get rid of them.
We could also remove the preceding "if (unlikely(...))" block, but then we could reach pte_offset_map_lock() with transhuge pages not just for file mappings but also for anonymous mappings - which would probably be fine but I think is not necessarily expected.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock") Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Qi Zheng <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
71c186ef |
| 13-Aug-2024 |
Jann Horn <[email protected]> |
userfaultfd: fix checks for huge PMDs
Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
The pmd_trans_huge() code in mfill_atomic() is wrong in three different ways depending
userfaultfd: fix checks for huge PMDs
Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
The pmd_trans_huge() code in mfill_atomic() is wrong in three different ways depending on kernel version:
1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit the right two race windows) - I've tested this in a kernel build with some extra mdelay() calls. See the commit message for a description of the race scenario. On older kernels (before 6.5), I think the same bug can even theoretically lead to accessing transhuge page contents as a page table if you hit the right 5 narrow race windows (I haven't tested this case). 2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for detecting PMDs that don't point to page tables. On older kernels (before 6.5), you'd just have to win a single fairly wide race to hit this. I've tested this on 6.1 stable by racing migration (with a mdelay() patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86 VM, that causes a kernel oops in ptlock_ptr(). 3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed to yank page tables out from under us (though I haven't tested that), so I think the BUG_ON() checks in mfill_atomic() are just wrong.
I decided to write two separate fixes for these (one fix for bugs 1+2, one fix for bug 3), so that the first fix can be backported to kernels affected by bugs 1+2.
This patch (of 2):
This fixes two issues.
I discovered that the following race can occur:
mfill_atomic other thread ============ ============ <zap PMD> pmdp_get_lockless() [reads none pmd] <bail if trans_huge> <if none:> <pagefault creates transhuge zeropage> __pte_alloc [no-op] <zap PMD> <bail if pmd_trans_huge(*dst_pmd)> BUG_ON(pmd_none(*dst_pmd))
I have experimentally verified this in a kernel with extra mdelay() calls; the BUG_ON(pmd_none(*dst_pmd)) triggers.
On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to fail"), this can't lead to anything worse than a BUG_ON(), since the page table access helpers are actually designed to deal with page tables concurrently disappearing; but on older kernels (<=6.4), I think we could probably theoretically race past the two BUG_ON() checks and end up treating a hugepage as a page table.
The second issue is that, as Qi Zheng pointed out, there are other types of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs (in particular, migration PMDs).
On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a PMD that contains a migration entry (which just requires winning a single, fairly wide race), it will pass the PMD to pte_offset_map_lock(), which assumes that the PMD points to a page table.
Breakage follows: First, the kernel tries to take the PTE lock (which will crash or maybe worse if there is no "struct page" for the address bits in the migration entry PMD - I think at least on X86 there usually is no corresponding "struct page" thanks to the PTE inversion mitigation, amd64 looks different).
If that didn't crash, the kernel would next try to write a PTE into what it wrongly thinks is a page table.
As part of fixing these issues, get rid of the check for pmd_trans_huge() before __pte_alloc() - that's redundant, we're going to have to check for that after the __pte_alloc() anyway.
Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") Signed-off-by: Jann Horn <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Qi Zheng <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7 |
|
| #
e6c0c032 |
| 02-Jul-2024 |
Christophe Leroy <[email protected]> |
mm: provide mm_struct and address to huge_ptep_get()
On powerpc 8xx huge_ptep_get() will need to know whether the given ptep is a PTE entry or a PMD entry. This cannot be known with the PMD entry i
mm: provide mm_struct and address to huge_ptep_get()
On powerpc 8xx huge_ptep_get() will need to know whether the given ptep is a PTE entry or a PMD entry. This cannot be known with the PMD entry itself because there is no easy way to know it from the content of the entry.
So huge_ptep_get() will need to know either the size of the page or get the pmd.
In order to be consistent with huge_ptep_get_and_clear(), give mm and address to huge_ptep_get().
Link: https://lkml.kernel.org/r/cc00c70dd384298796a4e1b25d6c4eb306d3af85.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc6, v6.10-rc5 |
|
| #
15bde4ab |
| 17-Jun-2024 |
Barry Song <[email protected]> |
mm: extend rmap flags arguments for folio_add_new_anon_rmap
Patch series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()", v2.
This patchset is preparatory work for mTHP swapin.
mm: extend rmap flags arguments for folio_add_new_anon_rmap
Patch series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()", v2.
This patchset is preparatory work for mTHP swapin.
folio_add_new_anon_rmap() assumes that new anon rmaps are always exclusive. However, this assumption doesn’t hold true for cases like do_swap_page(), where a new anon might be added to the swapcache and is not necessarily exclusive.
The patchset extends the rmap flags to allow folio_add_new_anon_rmap() to handle both exclusive and non-exclusive new anon folios. The do_swap_page() function is updated to use this extended API with rmap flags. Consequently, all new anon folios now consistently use folio_add_new_anon_rmap(). The special case for !folio_test_anon() in __folio_add_anon_rmap() can be safely removed.
In conclusion, new anon folios always use folio_add_new_anon_rmap(), regardless of exclusivity. Old anon folios continue to use __folio_add_anon_rmap() via folio_add_anon_rmap_pmd() and folio_add_anon_rmap_ptes().
This patch (of 3):
In the case of a swap-in, a new anonymous folio is not necessarily exclusive. This patch updates the rmap flags to allow a new anonymous folio to be treated as either exclusive or non-exclusive. To maintain the existing behavior, we always use EXCLUSIVE as the default setting.
[[email protected]: cleanup and constifications per David and akpm] [[email protected]: fix missing doc for flags of folio_add_new_anon_rmap()] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: enhance doc for extend rmap flags arguments for folio_add_new_anon_rmap] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Tested-by: Shuai Yuan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Chris Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
37a4b34a |
| 31-May-2024 |
Jiapeng Chong <[email protected]> |
mm: userfaultfd: use swap() in double_pt_lock()
Use existing swap() function rather than duplicating its implementation.
./mm/userfaultfd.c:1006:13-14: WARNING opportunity for swap()
Link: https:/
mm: userfaultfd: use swap() in double_pt_lock()
Use existing swap() function rather than duplicating its implementation.
./mm/userfaultfd.c:1006:13-14: WARNING opportunity for swap()
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiapeng Chong <[email protected]> Reported-by: Abaci Robot <[email protected]> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9266 Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6 |
|
| #
73b4a0cd |
| 26-Apr-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: fix some minor per-VMA lock issues in userfaultfd
Rename lock_vma() to uffd_lock_vma() because it really is uffd specific. Remove comment referencing unlock_vma() which doesn't exist. Fix the
mm: fix some minor per-VMA lock issues in userfaultfd
Rename lock_vma() to uffd_lock_vma() because it really is uffd specific. Remove comment referencing unlock_vma() which doesn't exist. Fix the comment about lock_vma_under_rcu() which I just made incorrect.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
a568b412 |
| 23-Apr-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
userfault; expand folio use in mfill_atomic_install_pte()
Call page_folio() a little earlier so we can use folio_mapping() instead of page_mapping(), saving a call to compound_head().
Link: https:/
userfault; expand folio use in mfill_atomic_install_pte()
Call page_folio() a little earlier so we can use folio_mapping() instead of page_mapping(), saving a call to compound_head().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Eric Biggers <[email protected]> Cc: Sidhartha Kumar <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5 |
|
| #
b5ba3a64 |
| 15-Apr-2024 |
Suren Baghdasaryan <[email protected]> |
userfaultfd: remove WRITE_ONCE when setting folio->index during UFFDIO_MOVE
When folio is moved with UFFDIO_MOVE it gets locked before the rmap and index are modified. Due to the folio lock being a
userfaultfd: remove WRITE_ONCE when setting folio->index during UFFDIO_MOVE
When folio is moved with UFFDIO_MOVE it gets locked before the rmap and index are modified. Due to the folio lock being already held, WRITE_ONCE() is not needed when setting the folio index. Remove it.
Link: https://lkml.kernel.org/r/[email protected] Reported-by: Matthew Wilcox <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Lokesh Gidra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2 |
|
| #
e06d03d5 |
| 26-Mar-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: add pmd_folio()
Convert directly from a pmd to a folio without going through another representation first. For now this is just a slightly shorter way to write it, but it might end up being mor
mm: add pmd_folio()
Convert directly from a pmd to a folio without going through another representation first. For now this is just a slightly shorter way to write it, but it might end up being more efficient later.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
5beaee54 |
| 26-Mar-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: add is_huge_zero_folio()
This is the folio equivalent of is_huge_zero_page(). It doesn't add any efficiency, but it does prevent the caller from passing a tail page and getting confused when th
mm: add is_huge_zero_folio()
This is the folio equivalent of is_huge_zero_page(). It doesn't add any efficiency, but it does prevent the caller from passing a tail page and getting confused when the predicate returns false.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
90a7592d |
| 11-Apr-2024 |
David Hildenbrand <[email protected]> |
mm/userfaultfd: Do not place zeropages when zeropages are disallowed
s390x must disable shared zeropages for processes running VMs, because the VMs could end up making use of "storage keys" or prote
mm/userfaultfd: Do not place zeropages when zeropages are disallowed
s390x must disable shared zeropages for processes running VMs, because the VMs could end up making use of "storage keys" or protected virtualization, which are incompatible with shared zeropages.
Yet, with userfaultfd it is possible to insert shared zeropages into such processes. Let's fallback to simply allocating a fresh zeroed anonymous folio and insert that instead.
mm_forbids_zeropage() was introduced in commit 593befa6ab74 ("mm: introduce mm_forbids_zeropage function"), briefly before userfaultfd went upstream.
Note that we don't want to fail the UFFDIO_ZEROPAGE request like we do for hugetlb, it would be rather unexpected. Further, we also cannot really indicated "not supported" to user space ahead of time: it could be that the MM disallows zeropages after userfaultfd was already registered.
[ agordeev: Fixed checkpatch complaints ]
Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") Reviewed-by: Peter Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc1 |
|
| #
30af24fa |
| 21-Mar-2024 |
Lokesh Gidra <[email protected]> |
userfaultfd: fix deadlock warning when locking src and dst VMAs
Use down_read_nested() to avoid the warning.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 86
userfaultfd: fix deadlock warning when locking src and dst VMAs
Use down_read_nested() to avoid the warning.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 867a43a34ff8 ("userfaultfd: use per-vma locks in userfaultfd operations") Reported-by: [email protected] Signed-off-by: Lokesh Gidra <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Jann Horn <[email protected]> [Bug #2] Cc: Kalesh Singh <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.8 |
|
| #
b14d1671 |
| 07-Mar-2024 |
James Houghton <[email protected]> |
mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
Users of UFFDIO_CONTINUE may reasonably assume that a write memory barrier is included as part of UFFDIO_CONTINUE. That is, a user may believe that
mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
Users of UFFDIO_CONTINUE may reasonably assume that a write memory barrier is included as part of UFFDIO_CONTINUE. That is, a user may believe that all writes it has done to a page that it is now UFFDIO_CONTINUE'ing are guaranteed to be visible to anyone subsequently reading the page through the newly mapped virtual memory region.
Today, such a user happens to be correct. mmget_not_zero(), for example, is called as part of UFFDIO_CONTINUE (and comes before any PTE updates), and it implicitly gives us a write barrier.
To be resilient against future changes, include an explicit smp_wmb(). While we're at it, optimize the smp_wmb() that is already incidentally present for the HugeTLB case.
Merely making a syscall does not generally imply the memory ordering constraints that we need (including on x86).
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: James Houghton <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|