|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4 |
|
| #
e9f180d7 |
| 22-Apr-2025 |
David Hildenbrand <[email protected]> |
kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
Not intuitive, but vm_area_dup() located in kernel/fork.c is not only used for duplicating VMAs during fork(), but also for d
kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
Not intuitive, but vm_area_dup() located in kernel/fork.c is not only used for duplicating VMAs during fork(), but also for duplicating VMAs when splitting VMAs or when mremap()'ing them.
VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in size) and apparently also shrunk during mremap(), which implies duplicating the VMA in __split_vma() first.
In case of ordinary mremap() (no change in size), we first duplicate the VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on the old VMA: we effectively move the VM_PAT reservation. So the untrack_pfn_clear() call on the new VMA duplicating is wrong in that context.
Splitting of VMAs seems problematic, because we don't duplicate/adjust the reservation when splitting the VMA. Instead, in memtype_erase() -- called during zapping/munmap -- we shrink a reservation in case only the end address matches: Assume we split a VMA into A and B, both would share a reservation until B is unmapped.
So when unmapping B, the reservation would be updated to cover only A. When unmapping A, we would properly remove the now-shrunk reservation. That scenario describes the mremap() shrinking (old_size > new_size), where we split + unmap B, and the untrack_pfn_clear() on the new VMA when is wrong.
What if we manage to split a VM_PFNMAP VMA into A and B and unmap A first? It would be broken because we would never free the reservation. Likely, there are ways to trigger such a VMA split outside of mremap().
Affecting other VMA duplication was not intended, vm_area_dup() being used outside of kernel/fork.c was an oversight. So let's fix that for; how to handle VMA splits better should be investigated separately.
With a simple reproducer that uses mprotect() to split such a VMA I can trigger
x86/PAT: pat_mremap:26448 freeing invalid memtype [mem 0x00000000-0x00000fff]
Link: https://lkml.kernel.org/r/[email protected] Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
dc84bc2a |
| 21-Mar-2025 |
David Hildenbrand <[email protected]> |
x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
If track_pfn_copy() fails, we already added the dst VMA to the maple tree. As fork() fails, we'll cleanup the maple tree, and s
x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
If track_pfn_copy() fails, we already added the dst VMA to the maple tree. As fork() fails, we'll cleanup the maple tree, and stumble over the dst VMA for which we neither performed any reservation nor copied any page tables.
Consequently untrack_pfn() will see VM_PAT and try obtaining the PAT information from the page table -- which fails because the page table was not copied.
The easiest fix would be to simply clear the VM_PAT flag of the dst VMA if track_pfn_copy() fails. However, the whole thing is about "simply" clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy() and performed a reservation, but copying the page tables fails, we'll simply clear the VM_PAT flag, not properly undoing the reservation ... which is also wrong.
So let's fix it properly: set the VM_PAT flag only if the reservation succeeded (leaving it clear initially), and undo the reservation if anything goes wrong while copying the page tables: clearing the VM_PAT flag after undoing the reservation.
Note that any copied page table entries will get zapped when the VMA will get removed later, after copy_page_range() succeeded; as VM_PAT is not set then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be happy. Note that leaving these page tables in place without a reservation is not a problem, as we are aborting fork(); this process will never run.
A reproducer can trigger this usually at the first try:
https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c
WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110 Modules linked in: ... CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:get_pat_info+0xf6/0x110 ... Call Trace: <TASK> ... untrack_pfn+0x52/0x110 unmap_single_vma+0xa6/0xe0 unmap_vmas+0x105/0x1f0 exit_mmap+0xf6/0x460 __mmput+0x4b/0x120 copy_process+0x1bf6/0x2aa0 kernel_clone+0xab/0x440 __do_sys_clone+0x66/0x90 do_syscall_64+0x95/0x180
Likely this case was missed in:
d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
... and instead of undoing the reservation we simply cleared the VM_PAT flag.
Keep the documentation of these functions in include/linux/pgtable.h, one place is more than sufficient -- we should clean that up for the other functions like track_pfn_remap/untrack_pfn separately.
Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3") Reported-by: xingwei lee <[email protected]> Reported-by: yuxin wang <[email protected]> Reported-by: Marius Fleischer <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/
show more ...
|
| #
b25bcabb |
| 17-Mar-2025 |
Uladzislau Rezki (Sony) <[email protected]> |
fork: use __vmalloc_node() for stack allocation
Replace __vmalloc_node_range() by __vmalloc_node(). The last variant requires less parameters and it uses exactly the same arguments which are partly
fork: use __vmalloc_node() for stack allocation
Replace __vmalloc_node_range() by __vmalloc_node(). The last variant requires less parameters and it uses exactly the same arguments which are partly now hidden inside __vmalloc_node().
This change does not change any functionality. It makes the code a bit simpler.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7 |
|
| #
68db2727 |
| 16-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: ensure that PIDFS_INFO_EXIT is available
When we currently create a pidfd we check that the task hasn't been reaped right before we create the pidfd. But it is of course possible that by the
pidfs: ensure that PIDFS_INFO_EXIT is available
When we currently create a pidfd we check that the task hasn't been reaped right before we create the pidfd. But it is of course possible that by the time we return the pidfd to userspace the task has already been reaped since we don't check again after having created a dentry for it.
This was fine until now because that race was meaningless. But now that we provide PIDFD_INFO_EXIT it is a problem because it is possible that the kernel returns a reaped pidfd and it depends on the race whether PIDFD_INFO_EXIT information is available. This depends on if the task gets reaped before or after a dentry has been attached to struct pid.
Make this consistent and only returned pidfds for reaped tasks if PIDFD_INFO_EXIT information is available. This is done by performing another check whether the task has been reaped right after we attached a dentry to struct pid.
Since pidfs_exit() is called before struct pid's task linkage is removed the case where the task got reaped but a dentry was already attached to struct pid and exit information was recorded and published can be handled correctly. In that case we do return a pidfd for a reaped task like we would've before.
Link: https://lore.kernel.org/r/20250316-kabel-fehden-66bdb6a83436@brauner Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
6af8cb80 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: basic MM owner tracking for large folios (!hugetlb)
For small folios, we traditionally use the mapcount to decide whether it was "certainly mapped exclusively" by a single MM (mapcount == 1
mm/rmap: basic MM owner tracking for large folios (!hugetlb)
For small folios, we traditionally use the mapcount to decide whether it was "certainly mapped exclusively" by a single MM (mapcount == 1) or whether it "maybe mapped shared" by multiple MMs (mapcount > 1). For PMD-sized folios that were PMD-mapped, we were able to use a similar mechanism (single PMD mapping), but for PTE-mapped folios and in the future folios that span multiple PMDs, this does not work.
So we need a different mechanism to handle large folios. Let's add a new mechanism to detect whether a large folio is "certainly mapped exclusively", or whether it is "maybe mapped shared".
We'll use this information next to optimize CoW reuse for PTE-mapped anonymous THP, and to convert folio_likely_mapped_shared() to folio_maybe_mapped_shared(), independent of per-page mapcounts.
For each large folio, we'll have two slots, whereby a slot stores: (1) an MM id: unique id assigned to each MM (2) a per-MM mapcount
If a slot is unoccupied, it can be taken by the next MM that maps folio page.
In addition, we'll remember the current state -- "mapped exclusively" vs. "maybe mapped shared" -- and use a bit spinlock to sync on updates and to reduce the total number of atomic accesses on updates. In the future, it might be possible to squeeze a proper spinlock into "struct folio". For now, keep it simple, as we require the whole thing with THP only, that is incompatible with RT.
As we have to squeeze this information into the "struct folio" of even folios of order-1 (2 pages), and we generally want to reduce the required metadata, we'll assign each MM a unique ID that can fit into an int. In total, we can squeeze everything into 4x int (2x long) on 64bit.
32bit support is a bit challenging, because we only have 2x long == 2x int in order-1 folios. But we can make it work for now, because we neither expect many MMs nor very large folios on 32bit.
We will reliably detect folios as "mapped exclusively" vs. "mapped shared" as long as only two MMs map pages of a folio at one point in time -- for example with fork() and short-lived child processes, or with apps that hand over state from one instance to another.
As soon as three MMs are involved at the same time, we might detect "maybe mapped shared" although the folio is "mapped exclusively".
Example 1:
(1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (4) App1 unmaps all folio pages
-> We will detect "mapped exclusively".
Example 2:
(1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (3) App3 faults in a folio page -> No slot available, tracked as "unknown" (4) App1 and App2 unmap all folio pages
-> We will detect "maybe mapped shared".
Make use of __always_inline to keep possible performance degradation when (un)mapping large folios to a minimum.
Note: by squeezing the two flags into the "unsigned long" that stores the MM ids, we can use non-atomic __bit_spin_unlock() and non-atomic setting/clearing of the "maybe mapped shared" bit, effectively not adding any new atomics on the hot path when updating the large mapcount + new metadata, which further helps reduce the runtime overhead in micro-benchmarks.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
0c555a3c |
| 27-Jan-2025 |
Andrii Nakryiko <[email protected]> |
mm,procfs: allow read-only remote mm access under CAP_PERFMON
It's very common for various tracing and profiling toolis to need to access /proc/PID/maps contents for stack symbolization needs to lea
mm,procfs: allow read-only remote mm access under CAP_PERFMON
It's very common for various tracing and profiling toolis to need to access /proc/PID/maps contents for stack symbolization needs to learn which shared libraries are mapped in memory, at which file offset, etc. Currently, access to /proc/PID/maps requires CAP_SYS_PTRACE (unless we are looking at data for our own process, which is a trivial case not too relevant for profilers use cases).
Unfortunately, CAP_SYS_PTRACE implies way more than just ability to discover memory layout of another process: it allows to fully control arbitrary other processes. This is problematic from security POV for applications that only need read-only /proc/PID/maps (and other similar read-only data) access, and in large production settings CAP_SYS_PTRACE is frowned upon even for the system-wide profilers.
On the other hand, it's already possible to access similar kind of information (and more) with just CAP_PERFMON capability. E.g., setting up PERF_RECORD_MMAP collection through perf_event_open() would give one similar information to what /proc/PID/maps provides.
CAP_PERFMON, together with CAP_BPF, is already a very common combination for system-wide profiling and observability application. As such, it's reasonable and convenient to be able to access /proc/PID/maps with CAP_PERFMON capabilities instead of CAP_SYS_PTRACE.
For procfs, these permissions are checked through common mm_access() helper, and so we augment that with cap_perfmon() check *only* if requested mode is PTRACE_MODE_READ. I.e., PTRACE_MODE_ATTACH wouldn't be permitted by CAP_PERFMON. So /proc/PID/mem, which uses PTRACE_MODE_ATTACH, won't be permitted by CAP_PERFMON, but /proc/PID/maps, /proc/PID/environ, and a bunch of other read-only contents will be allowable under CAP_PERFMON.
Besides procfs itself, mm_access() is used by process_madvise() and process_vm_{readv,writev}() syscalls. The former one uses PTRACE_MODE_READ to avoid leaking ASLR metadata, and as such CAP_PERFMON seems like a meaningful allowable capability as well.
process_vm_{readv,writev} currently assume PTRACE_MODE_ATTACH level of permissions (though for readv PTRACE_MODE_READ seems more reasonable, but that's outside the scope of this change), and as such won't be affected by this patch.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrii Nakryiko <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kees Cook <[email protected]> Cc: Liam Howlett <[email protected]> Cc: "Mike Rapoport (IBM)" <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
31041385 |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: make vma cache SLAB_TYPESAFE_BY_RCU
To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that object reuse before RCU grace period is over will be detected by lock_vma_under_rcu().
Cu
mm: make vma cache SLAB_TYPESAFE_BY_RCU
To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that object reuse before RCU grace period is over will be detected by lock_vma_under_rcu().
Current checks are sufficient as long as vma is detached before it is freed. The only place this is not currently happening is in exit_mmap(). Add the missing vma_mark_detached() in exit_mmap().
Another issue which might trick lock_vma_under_rcu() during vma reuse is vm_area_dup(), which copies the entire content of the vma into a new one, overriding new vma's vm_refcnt and temporarily making it appear as attached. This might trick a racing lock_vma_under_rcu() to operate on a reused vma if it found the vma before it got reused. To prevent this situation, we should ensure that vm_refcnt stays at detached state (0) when it is copied and advances to attached state only after it is added into the vma tree. Introduce vm_area_init_from() which preserves new vma's vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached state with no current readers when they are freed,
lock_vma_under_rcu() will not be able to take vm_refcnt after vma got detached even if vma is reused. vma_mark_attached() in modified to include a release fence to ensure all stores to the vma happen before vm_refcnt gets initialized.
Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and will minimize the number of call_rcu() calls.
[[email protected]: remove atomic_set_release() usage in tools/] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
f35ab95c |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: replace vm_lock and detached flag with a reference count
rw_semaphore is a sizable structure of 40 bytes and consumes considerable space for each vm_area_struct. However vma_lock has two import
mm: replace vm_lock and detached flag with a reference count
rw_semaphore is a sizable structure of 40 bytes and consumes considerable space for each vm_area_struct. However vma_lock has two important specifics which can be used to replace rw_semaphore with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock because writers first take mmap_lock in write mode. Because of these requirements, full rw_semaphore functionality is not needed and we can replace rw_semaphore and the vma->detached flag with a refcount (vm_refcnt).
When vma is in detached state, vm_refcnt is 0 and only a call to vma_mark_attached() can take it out of this state. Note that unlike before, now we enforce both vma_mark_attached() and vma_mark_detached() to be done only after vma has been write-locked. vma_mark_attached() changes vm_refcnt to 1 to indicate that it has been attached to the vma tree. When a reader takes read lock, it increments vm_refcnt, unless the top usable bit of vm_refcnt (0x40000000) is set, indicating presence of a writer. When writer takes write lock, it sets the top usable bit to indicate its presence. If there are readers, writer will wait using newly introduced mm->vma_writer_wait. Since all writers take mmap_lock in write mode first, there can be only one writer at a time. The last reader to release the lock will signal the writer to wake up. refcount might overflow if there are many competing readers, in which case read-locking will fail. Readers are expected to handle such failures.
In summary: 1. all readers increment the vm_refcnt; 2. writer sets top usable (writer) bit of vm_refcnt; 3. readers cannot increment the vm_refcnt if the writer bit is set; 4. in the presence of readers, writer must wait for the vm_refcnt to drop to 1 (plus the VMA_LOCK_OFFSET writer bit), indicating an attached vma with no readers; 5. vm_refcnt overflow is handled by the readers.
While this vm_lock replacement does not yet result in a smaller vm_area_struct (it stays at 256 bytes due to cacheline alignment), it allows for further size optimization by structure member regrouping to bring the size of vm_area_struct below 192 bytes.
[[email protected]: fix a crash due to vma_end_read() that should have been removed] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Vlastimil Babka <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
ce085396 |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: move mmap_init_lock() out of the header file
mmap_init_lock() is used only from mm_init() in fork.c, therefore it does not have to reside in the header file. This move lets us avoid including a
mm: move mmap_init_lock() out of the header file
mmap_init_lock() is used only from mm_init() in fork.c, therefore it does not have to reside in the header file. This move lets us avoid including additional headers in mmap_lock.h later, when mmap_init_lock() needs to initialize rcuwait object.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8ef95d8f |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: mark vma as detached until it's added into vma tree
Current implementation does not set detached flag when a VMA is first allocated. This does not represent the real state of the VMA, which is
mm: mark vma as detached until it's added into vma tree
Current implementation does not set detached flag when a VMA is first allocated. This does not represent the real state of the VMA, which is detached until it is added into mm's VMA tree. Fix this by marking new VMAs as detached and resetting detached flag only after VMA is added into a tree.
Introduce vma_mark_attached() to make the API more readable and to simplify possible future cleanup when vma->vm_mm might be used to indicate detached vma and vma_mark_attached() will need an additional mm parameter.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
7b6218ae |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: move per-vma lock into vm_area_struct
Back when per-vma locks were introduces, vm_lock was moved out of vm_area_struct in [1] because of the performance regression caused by false cacheline shar
mm: move per-vma lock into vm_area_struct
Back when per-vma locks were introduces, vm_lock was moved out of vm_area_struct in [1] because of the performance regression caused by false cacheline sharing. Recent investigation [2] revealed that the regressions is limited to a rather old Broadwell microarchitecture and even there it can be mitigated by disabling adjacent cacheline prefetching, see [3].
Splitting single logical structure into multiple ones leads to more complicated management, extra pointer dereferences and overall less maintainable code. When that split-away part is a lock, it complicates things even further. With no performance benefits, there are no reasons for this split. Merging the vm_lock back into vm_area_struct also allows vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset. Move vm_lock back into vm_area_struct, aligning it at the cacheline boundary and changing the cache to be cacheline-aligned as well. With kernel compiled using defconfig, this causes VMA memory consumption to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:
slabinfo before: <name> ... <objsize> <objperslab> <pagesperslab> : ... vma_lock ... 40 102 1 : ... vm_area_struct ... 160 51 2 : ...
slabinfo after moving vm_lock: <name> ... <objsize> <objperslab> <pagesperslab> : ... vm_area_struct ... 256 32 2 : ...
Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages, which is 5.5MB per 100000 VMAs. Note that the size of this structure is dependent on the kernel configuration and typically the original size is higher than 160 bytes. Therefore these calculations are close to the worst case scenario. A more realistic vm_area_struct usage before this change is:
<name> ... <objsize> <objperslab> <pagesperslab> : ... vma_lock ... 40 102 1 : ... vm_area_struct ... 176 46 2 : ...
Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages, which is 3.9MB per 100000 VMAs. This memory consumption growth can be addressed later by optimizing the vm_lock.
[1] https://lore.kernel.org/all/[email protected]/ [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/ [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
3155a194 |
| 05-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: move setting flags into pidfs_alloc_file()
Instead od adding it into __pidfd_prepare() place it where the actual file allocation happens and update the outdated comment.
Link: https://lore.k
pidfs: move setting flags into pidfs_alloc_file()
Instead od adding it into __pidfd_prepare() place it where the actual file allocation happens and update the outdated comment.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-3-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
b573bf6f |
| 05-Mar-2025 |
Christian Brauner <[email protected]> |
pidfd: rely on automatic cleanup in __pidfd_prepare()
Rely on scope-based cleanup for the allocated file descriptor.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-2-c8c3
pidfd: rely on automatic cleanup in __pidfd_prepare()
Rely on scope-based cleanup for the allocated file descriptor.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-2-c8c3d8361705@kernel.org Acked-by: Jeff Layton <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
91b7be70 |
| 05-Feb-2025 |
Nam Cao <[email protected]> |
fork: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
fork: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism.
Patch was created by using Coccinelle.
Signed-off-by: Nam Cao <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/174111145b945391e48936d6debcd43caec3e406.1738746821.git.namcao@linutronix.de
show more ...
|
| #
64c37e13 |
| 27-Jan-2025 |
Liam R. Howlett <[email protected]> |
kernel: be more careful about dup_mmap() failures and uprobe registering
If a memory allocation fails during dup_mmap(), the maple tree can be left in an unsafe state for other iterators besides the
kernel: be more careful about dup_mmap() failures and uprobe registering
If a memory allocation fails during dup_mmap(), the maple tree can be left in an unsafe state for other iterators besides the exit path. All the locks are dropped before the exit_mmap() call (in mm/mmap.c), but the incomplete mm_struct can be reached through (at least) the rmap finding the vmas which have a pointer back to the mm_struct.
Up to this point, there have been no issues with being able to find an mm_struct that was only partially initialised. Syzbot was able to make the incomplete mm_struct fail with recent forking changes, so it has been proven unsafe to use the mm_struct that hasn't been initialised, as referenced in the link below.
Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to invalid mm") fixed the uprobe access, it does not completely remove the race.
This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the oom side (even though this is extremely unlikely to be selected as an oom victim in the race window), and sets MMF_UNSTABLE to avoid other potential users from using a partially initialised mm_struct.
When registering vmas for uprobe, skip the vmas in an mm that is marked unstable. Modifying a vma in an unstable mm may cause issues if the mm isn't fully initialised.
Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Jann Horn <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peng Zhang <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
e5e7fb27 |
| 22-Nov-2024 |
Suren Baghdasaryan <[email protected]> |
mm: convert mm_lock_seq to a proper seqcount
Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock variants to increment it, in-line with the usual seqcount usage pattern. This lets us
mm: convert mm_lock_seq to a proper seqcount
Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock variants to increment it, in-line with the usual seqcount usage pattern. This lets us check whether the mmap_lock is write-locked by checking mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be used when implementing mmap_lock speculation functions. As a result vm_lock_seq is also change to be unsigned to match the type of mm_lock_seq.sequence.
Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
51f8bd6d |
| 19-Nov-2024 |
Mateusz Guzik <[email protected]> |
get_task_exe_file: check PF_KTHREAD locklessly
Same thing as 8ac5dc66599c ("get_task_mm: check PF_KTHREAD lockless")
Nowadays PF_KTHREAD is sticky and it was never protected by ->alloc_lock. Move
get_task_exe_file: check PF_KTHREAD locklessly
Same thing as 8ac5dc66599c ("get_task_mm: check PF_KTHREAD lockless")
Nowadays PF_KTHREAD is sticky and it was never protected by ->alloc_lock. Move the PF_KTHREAD check outside of task_lock() section to make this code more understandable.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mateusz Guzik <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8ac662f5 |
| 10-Dec-2024 |
Lorenzo Stoakes <[email protected]> |
fork: avoid inappropriate uprobe access to invalid mm
If dup_mmap() encounters an issue, currently uprobe is able to access the relevant mm via the reverse mapping (in build_map_info()), and if we a
fork: avoid inappropriate uprobe access to invalid mm
If dup_mmap() encounters an issue, currently uprobe is able to access the relevant mm via the reverse mapping (in build_map_info()), and if we are very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which we establish as part of the fork error path.
This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which in turn invokes find_mergeable_anon_vma() that uses a VMA iterator, invoking vma_iter_load() which uses the advanced maple tree API and thus is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()").
This change was made on the assumption that only process tear-down code would actually observe (and make use of) these values. However this very unlikely but still possible edge case with uprobes exists and unfortunately does make these observable.
The uprobe operation prevents races against the dup_mmap() operation via the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap() and dropped via uprobe_end_dup_mmap(), and held across register_for_each_vma() prior to invoking build_map_info() which does the reverse mapping lookup.
Currently these are acquired and dropped within dup_mmap(), which exposes the race window prior to error handling in the invoking dup_mm() which tears down the mm.
We can avoid all this by just moving the invocation of uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm() and only release this lock once the dup_mmap() operation succeeds or clean up is done.
This means that the uprobe code can never observe an incompletely constructed mm and resolves the issue in this case.
Link: https://lkml.kernel.org/r/[email protected] Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <[email protected]> Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected]/ Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kan Liang <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peng Zhang <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
0357ef03 |
| 28-Nov-2024 |
Amir Goldstein <[email protected]> |
fs: don't block write during exec on pre-content watched files
Commit 2a010c412853 ("fs: don't block i_writecount during exec") removed the legacy behavior of getting ETXTBSY on attempt to open and
fs: don't block write during exec on pre-content watched files
Commit 2a010c412853 ("fs: don't block i_writecount during exec") removed the legacy behavior of getting ETXTBSY on attempt to open and executable file for write while it is being executed.
This commit was reverted because an application that depends on this legacy behavior was broken by the change.
We need to allow HSM writing into executable files while executed to fill their content on-the-fly.
To that end, disable the ETXTBSY legacy behavior for files that are watched by pre-content events.
This change is not expected to cause regressions with existing systems which do not have any pre-content event listeners.
Signed-off-by: Amir Goldstein <[email protected]> Acked-by: Christian Brauner <[email protected]> Signed-off-by: Jan Kara <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
eb449bd9 |
| 22-Nov-2024 |
Suren Baghdasaryan <[email protected]> |
mm: convert mm_lock_seq to a proper seqcount
Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock variants to increment it, in-line with the usual seqcount usage pattern. This lets us
mm: convert mm_lock_seq to a proper seqcount
Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock variants to increment it, in-line with the usual seqcount usage pattern. This lets us check whether the mmap_lock is write-locked by checking mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be used when implementing mmap_lock speculation functions. As a result vm_lock_seq is also change to be unsigned to match the type of mm_lock_seq.sequence.
Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
3b832035 |
| 27-Nov-2024 |
Christian Brauner <[email protected]> |
Revert "fs: don't block i_writecount during exec"
This reverts commit 2a010c41285345da60cece35575b4e0af7e7bf44.
Rui Ueyama <[email protected]> writes:
> I'm the creator and the maintainer of the mo
Revert "fs: don't block i_writecount during exec"
This reverts commit 2a010c41285345da60cece35575b4e0af7e7bf44.
Rui Ueyama <[email protected]> writes:
> I'm the creator and the maintainer of the mold linker > (https://github.com/rui314/mold). Recently, we discovered that mold > started causing process crashes in certain situations due to a change > in the Linux kernel. Here are the details: > > - In general, overwriting an existing file is much faster than > creating an empty file and writing to it on Linux, so mold attempts to > reuse an existing executable file if it exists. > > - If a program is running, opening the executable file for writing > previously failed with ETXTBSY. If that happens, mold falls back to > creating a new file. > > - However, the Linux kernel recently changed the behavior so that > writing to an executable file is now always permitted > (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a010c412853). > > That caused mold to write to an executable file even if there's a > process running that file. Since changes to mmap'ed files are > immediately visible to other processes, any processes running that > file would almost certainly crash in a very mysterious way. > Identifying the cause of these random crashes took us a few days. > > Rejecting writes to an executable file that is currently running is a > well-known behavior, and Linux had operated that way for a very long > time. So, I don’t believe relying on this behavior was our mistake; > rather, I see this as a regression in the Linux kernel.
Quoting myself from commit 2a010c412853 ("fs: don't block i_writecount during exec")
> Yes, someone in userspace could potentially be relying on this. It's not > completely out of the realm of possibility but let's find out if that's > actually the case and not guess.
It seems we found out that someone is relying on this obscure behavior. So revert the change.
Link: https://github.com/rui314/mold/issues/1361 Link: https://lore.kernel.org/r/[email protected] Cc: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7 |
|
| #
69f032c9 |
| 05-Nov-2024 |
Thomas Gleixner <[email protected]> |
signal: Provide ignored_posix_timers list
To prepare for handling posix timer signals on sigaction(SIG_IGN) properly, add a list to task::signal.
This list will be used to queue posix timers so the
signal: Provide ignored_posix_timers list
To prepare for handling posix timer signals on sigaction(SIG_IGN) properly, add a list to task::signal.
This list will be used to queue posix timers so their signal can be requeued when SIG_IGN is lifted later.
Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/all/[email protected]
show more ...
|
|
Revision tags: v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1 |
|
| #
f2f48408 |
| 26-Sep-2024 |
Nanyong Sun <[email protected]> |
mm: move mm flags to mm_types.h
The types of mm flags are now far beyond the core dump related features. This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h. The linux/sched/c
mm: move mm flags to mm_types.h
The types of mm flags are now far beyond the core dump related features. This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h. The linux/sched/coredump.h has include the mm_types.h, so the C files related to coredump does not need to change head file inclusion. In addition, the inclusion of sched/coredump.h now can be deleted from the C files that irrelevant to core dump.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nanyong Sun <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
cd3f8467 |
| 24-Sep-2024 |
Lorenzo Stoakes <[email protected]> |
mm: refactor mm_access() to not return NULL
mm_access() can return NULL if the mm is not found, but this is handled the same as an error in all callers, with some translating this into an -ESRCH err
mm: refactor mm_access() to not return NULL
mm_access() can return NULL if the mm is not found, but this is handled the same as an error in all callers, with some translating this into an -ESRCH error.
Only proc_mem_open() returns NULL if no mm is found, however in this case it is clearer and makes more sense to explicitly handle the error. Additionally we take the opportunity to refactor the function to eliminate unnecessary nesting.
Simplify things by simply returning -ESRCH if no mm is found - this both eliminates confusing use of the IS_ERR_OR_NULL() macro, and simplifies callers which would return -ESRCH by returning this error directly.
[[email protected]: prefer neater pointer error comparison] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lorenzo Stoakes <[email protected]> Suggested-by: Arnd Bergmann <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
985da552 |
| 15-Oct-2024 |
Lorenzo Stoakes <[email protected]> |
fork: only invoke khugepaged, ksm hooks if no error
There is no reason to invoke these hooks early against an mm that is in an incomplete state.
The change in commit d24062914837 ("fork: use __mt_d
fork: only invoke khugepaged, ksm hooks if no error
There is no reason to invoke these hooks early against an mm that is in an incomplete state.
The change in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent.
Their placement early in dup_mmap() only appears to have been meaningful for early error checking, and since functionally it'd require a very small allocation to fail (in practice 'too small to fail') that'd only occur in the most dire circumstances, meaning the fork would fail or be OOM'd in any case.
Since both khugepaged and KSM tracking are there to provide optimisations to memory performance rather than critical functionality, it doesn't really matter all that much if, under such dire memory pressure, we fail to register an mm with these.
As a result, we follow the example of commit d2081b2bf819 ("mm: khugepaged: make khugepaged_enter() void function") and make ksm_fork() a void function also.
We only expose the mm to these functions once we are done with them and only if no error occurred in the fork operation.
Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <[email protected]> Reported-by: Jann Horn <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Jann Horn <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|