|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
7ab02bd3 |
| 03-Mar-2025 |
Nysal Jan K.A. <[email protected]> |
sched/membarrier: Fix redundant load of membarrier_state
On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE is not selected, sync_core_before_usermode() is a no-op. In membarrier_mm_sync_core
sched/membarrier: Fix redundant load of membarrier_state
On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE is not selected, sync_core_before_usermode() is a no-op. In membarrier_mm_sync_core_before_usermode() the compiler does not eliminate redundant branches and load of mm->membarrier_state for this case as the atomic_read() cannot be optimized away.
Here's a snippet of the code generated for finish_task_switch() on powerpc prior to this change:
1b786c: ld r26,2624(r30) # mm = rq->prev_mm; ....... 1b78c8: cmpdi cr7,r26,0 1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0> 1b78d0: ld r9,2312(r13) # current 1b78d4: ld r9,1888(r9) # current->mm 1b78d8: cmpd cr7,r26,r9 1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c> 1b78e0: hwsync 1b78e4: cmplwi cr7,r27,128 ....... 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) 1b7a74: b 1b78e0 <finish_task_switch+0xcc>
This was found while analyzing "perf c2c" reports on kernels prior to commit c1753fd02a00 ("mm: move mm_count into its own cache line") where mm_count was false sharing with membarrier_state.
There is a minor improvement in the size of finish_task_switch(). The following are results from bloat-o-meter for ppc64le:
GCC 7.5.0 --------- add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) Function old new delta finish_task_switch 884 852 -32
GCC 12.2.1 ---------- add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) Function old new delta finish_task_switch.isra 852 820 -32
LLVM 17.0.6 ----------- add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36) Function old new delta rt_mutex_schedule 120 104 -16 finish_task_switch 792 772 -20
Results on aarch64:
GCC 14.1.1 ---------- add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56) Function old new delta get_nohz_timer_target 352 356 +4 e843419@0b02_0000d7e7_408 8 - -8 e843419@01bb_000021d2_868 8 - -8 finish_task_switch.isra 592 548 -44
Signed-off-by: Nysal Jan K.A. <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Mathieu Desnoyers <[email protected]> Reviewed-by: Michael Ellerman <[email protected]> Reviewed-by: Segher Boessenkool <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1 |
|
| #
9a8da05d |
| 26-Sep-2024 |
Michal Hocko <[email protected]> |
Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"
This reverts commit eab0af905bfc3e9c05da2ca163d76a1513159aa4.
There is no existing user of those flags. PF_MEMALLOC_NOWARN is dange
Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"
This reverts commit eab0af905bfc3e9c05da2ca163d76a1513159aa4.
There is no existing user of those flags. PF_MEMALLOC_NOWARN is dangerous because a nested allocation context can use GFP_NOFAIL which could cause unexpected failure. Such a code would be hard to maintain because it could be deeper in the call chain.
PF_MEMALLOC_NORECLAIM has been added even when it was pointed out [1] that such a allocation contex is inherently unsafe if the context doesn't fully control all allocations called from this context.
While PF_MEMALLOC_NOWARN is not dangerous the way PF_MEMALLOC_NORECLAIM is it doesn't have any user and as Matthew has pointed out we are running out of those flags so better reclaim it without any real users.
[1] https://lore.kernel.org/all/ZcM0xtlKbAOFjv5n@tiehlicka/
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: James Morris <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Paul Moore <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Yafang Shao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7 |
|
| #
540e00a7 |
| 04-Sep-2024 |
Mark Brown <[email protected]> |
mm: pass vm_flags to generic_get_unmapped_area()
In preparation for using vm_flags to ensure guard pages for shadow stacks supply them as an argument to generic_get_unmapped_area(). The only user o
mm: pass vm_flags to generic_get_unmapped_area()
In preparation for using vm_flags to ensure guard pages for shadow stacks supply them as an argument to generic_get_unmapped_area(). The only user outside of the core code is the PowerPC book3s64 implementation which is trivially wrapping the generic implementation in the radix_enabled() case.
No functional changes.
Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-2-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Acked-by: Michael Ellerman <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David S. Miller <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: James Bottomley <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Naveen N Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
25d4054c |
| 04-Sep-2024 |
Mark Brown <[email protected]> |
mm: make arch_get_unmapped_area() take vm_flags by default
Patch series "mm: Care about shadow stack guard gap when getting an unmapped area", v2.
As covered in the commit log for c44357c2e76b ("x8
mm: make arch_get_unmapped_area() take vm_flags by default
Patch series "mm: Care about shadow stack guard gap when getting an unmapped area", v2.
As covered in the commit log for c44357c2e76b ("x86/mm: care about shadow stack guard gap during placement") our current mmap() implementation does not take care to ensure that a new mapping isn't placed with existing mappings inside it's own guard gaps. This is particularly important for shadow stacks since if two shadow stacks end up getting placed adjacent to each other then they can overflow into each other which weakens the protection offered by the feature.
On x86 there is a custom arch_get_unmapped_area() which was updated by the above commit to cover this case by specifying a start_gap for allocations with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and use the generic implementation of arch_get_unmapped_area() so let's make the equivalent change there so they also don't get shadow stack pages placed without guard pages. The arm64 and RISC-V shadow stack implementations are currently on the list:
https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec94743 https://lore.kernel.org/lkml/[email protected]/
Given the addition of the use of vm_flags in the generic implementation we also simplify the set of possibilities that have to be dealt with in the core code by making arch_get_unmapped_area() take vm_flags as standard. This is a bit invasive since the prototype change touches quite a few architectures but since the parameter is ignored the change is straightforward, the simplification for the generic code seems worth it.
This patch (of 3):
When we introduced arch_get_unmapped_area_vmflags() in 961148704acd ("mm: introduce arch_get_unmapped_area_vmflags()") we did so as part of properly supporting guard pages for shadow stacks on x86_64, which uses a custom arch_get_unmapped_area(). Equivalent features are also present on both arm64 and RISC-V, both of which use the generic implementation of arch_get_unmapped_area() and will require equivalent modification there. Rather than continue to deal with having two versions of the functions let's bite the bullet and have all implementations of arch_get_unmapped_area() take vm_flags as a parameter.
The new parameter is currently ignored by all implementations other than x86. The only caller that doesn't have a vm_flags available is mm_get_unmapped_area(), as for the x86 implementation and the wrapper used on other architectures this is modified to supply no flags.
No functional changes.
Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-0-a46b8b6dc0ed@kernel.org Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-1-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Acked-by: Helge Deller <[email protected]> [parisc] Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David S. Miller <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: James Bottomley <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2 |
|
| #
96114870 |
| 26-Mar-2024 |
Rick Edgecombe <[email protected]> |
mm: introduce arch_get_unmapped_area_vmflags()
When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN).
mm: introduce arch_get_unmapped_area_vmflags()
When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things:
1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in *its* guard gaps.
The longstanding behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA.
In order to take the start gap into account, the maple tree search needs to know the size of start gap the new mapping will need. The call chain from do_mmap() to the actual maple tree search looks like this:
do_mmap(size, vm_flags, map_flags, ..) mm/mmap.c:get_unmapped_area(size, map_flags, ...) arch_get_unmapped_area(size, map_flags, ...) vm_unmapped_area(struct vm_unmapped_area_info)
One option would be to add another MAP_ flag to mean a one page start gap (as is for shadow stack), but this consumes a flag unnecessarily. Another option could be to simply increase the size passed in do_mmap() by the start gap size, and adjust after the fact, but this will interfere with the alignment requirements passed in struct vm_unmapped_area_info, and unknown to mmap.c. Instead, introduce variants of arch_get_unmapped_area/_topdown() that take vm_flags. In future changes, these variants can be used in mmap.c:get_unmapped_area() to allow the vm_flags to be passed through to vm_unmapped_area(), while preserving the normal arch_get_unmapped_area/_topdown() for the existing callers.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rick Edgecombe <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Deepak Gupta <[email protected]> Cc: Guo Ren <[email protected]> Cc: Helge Deller <[email protected]> Cc: H. Peter Anvin (Intel) <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
529ce23a |
| 26-Mar-2024 |
Rick Edgecombe <[email protected]> |
mm: switch mm->get_unmapped_area() to a flag
The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() durin
mm: switch mm->get_unmapped_area() to a flag
The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm.
Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag.
Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness.
Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones.
The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config.
In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rick Edgecombe <[email protected]> Acked-by: Dave Hansen <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Cc: Dan Williams <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Deepak Gupta <[email protected]> Cc: Guo Ren <[email protected]> Cc: Helge Deller <[email protected]> Cc: H. Peter Anvin (Intel) <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mark Brown <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2 |
|
| #
eab0af90 |
| 26-Jan-2024 |
Kent Overstreet <[email protected]> |
mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN
Introduce PF_MEMALLOC_* equivalents of some GFP_ flags:
PF_MEMALLOC_NORECLAIM -> GFP_NOWAIT PF_MEMALLOC_NOWARN -> __GFP_NOWARN
Cc: Vlastimil
mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN
Introduce PF_MEMALLOC_* equivalents of some GFP_ flags:
PF_MEMALLOC_NORECLAIM -> GFP_NOWAIT PF_MEMALLOC_NOWARN -> __GFP_NOWARN
Cc: Vlastimil Babka <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: [email protected] Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
3f6d5e6a |
| 26-Jan-2024 |
Kent Overstreet <[email protected]> |
mm: introduce memalloc_flags_{save,restore}
Our proliferation of memalloc_*_{save,restore} APIs is getting a bit silly, this adds a generic version and converts the existing save/restore functions t
mm: introduce memalloc_flags_{save,restore}
Our proliferation of memalloc_*_{save,restore} APIs is getting a bit silly, this adds a generic version and converts the existing save/restore functions to wrappers.
Signed-off-by: Kent Overstreet <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: [email protected] Acked-by: Vlastimil Babka <[email protected]>
show more ...
|
| #
cfb837e8 |
| 12-Feb-2024 |
Vlastimil Babka <[email protected]> |
mm: document memalloc_noreclaim_save() and memalloc_pin_save()
The memalloc_noreclaim_save() function currently has no documentation comment, so the implications of its usage are not obvious. Namel
mm: document memalloc_noreclaim_save() and memalloc_pin_save()
The memalloc_noreclaim_save() function currently has no documentation comment, so the implications of its usage are not obvious. Namely that it not only prevents entering reclaim (as the name suggests), but also allows using all memory reserves and thus should be only used in contexts that are allocating memory to free memory. This may lead to new improper usages being added.
Thus add a documenting comment, based on the description of __GFP_MEMALLOC. While at it, also document memalloc_pin_save() so that all the memalloc_ scopes are documented. For those already documented, add missing Return: descriptions, and mark Context: description per kernel-docs style guide.
In the comments describing the relevant PF_MEMALLOC flags, refer to their scope setting functions.
[[email protected]: fix issues that Mike pointed out] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Pasha Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7 |
|
| #
e86828e5 |
| 19-Oct-2023 |
Roman Gushchin <[email protected]> |
mm: kmem: scoped objcg protection
Switch to a scope-based protection of the objcg pointer on slab/kmem allocation paths. Instead of using the get_() semantics in the pre-allocation hook and put the
mm: kmem: scoped objcg protection
Switch to a scope-based protection of the objcg pointer on slab/kmem allocation paths. Instead of using the get_() semantics in the pre-allocation hook and put the reference afterwards, let's rely on the fact that objcg is pinned by the scope.
It's possible because: 1) if the objcg is received from the current task struct, the task is keeping a reference to the objcg. 2) if the objcg is received from an active memcg (remote charging), the memcg is pinned by the scope and has a reference to the corresponding objcg.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin (Cruise) <[email protected]> Tested-by: Naresh Kamboju <[email protected]> Acked-by: Shakeel Butt <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3 |
|
| #
223baf9d |
| 20-Apr-2023 |
Mathieu Desnoyers <[email protected]> |
sched: Fix performance regression introduced by mm_cid
Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL sysbench regression reported by Aaron Lu.
Keep track of the currently
sched: Fix performance regression introduced by mm_cid
Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL sysbench regression reported by Aaron Lu.
Keep track of the currently allocated mm_cid for each mm/cpu rather than freeing them immediately on context switch. This eliminates most atomic operations when context switching back and forth between threads belonging to different memory spaces in multi-threaded scenarios (many processes, each with many threads). The per-mm/per-cpu mm_cid values are serialized by their respective runqueue locks.
Thread migration is handled by introducing invocation to sched_mm_cid_migrate_to() (with destination runqueue lock held) in activate_task() for migrating tasks. If the destination cpu's mm_cid is unset, and if the source runqueue is not actively using its mm_cid, then the source cpu's mm_cid is moved to the destination cpu on migration.
Introduce a task-work executed periodically, similarly to NUMA work, which delays reclaim of cid values when they are unused for a period of time.
Keep track of the allocation time for each per-cpu cid, and let the task work clear them when they are observed to be older than SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all mm_cids which are greater or equal to the Hamming weight of the mm cidmask to keep concurrency ids compact.
Because we want to ensure the mm_cid converges towards the smaller values as migrations happen, the prior optimization that was done when context switching between threads belonging to the same mm is removed, because it could delay the lazy release of the destination runqueue mm_cid after it has been replaced by a migration. Removing this prior optimization is not an issue performance-wise because the introduced per-mm/per-cpu mm_cid tracking also covers this more specific case.
Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") Reported-by: Aaron Lu <[email protected]> Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Aaron Lu <[email protected]> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
show more ...
|
|
Revision tags: v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4 |
|
| #
cd389115 |
| 22-Mar-2023 |
Jacob Pan <[email protected]> |
iommu/sva: Move PASID helpers to sva code
Preparing to remove IOASID infrastructure, PASID management will be under SVA code. Decouple mm code from IOASID.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.
iommu/sva: Move PASID helpers to sva code
Preparing to remove IOASID infrastructure, PASID management will be under SVA code. Decouple mm code from IOASID.
Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7 |
|
| #
88e3009b |
| 03-Feb-2023 |
Nicholas Piggin <[email protected]> |
lazy tlb: allow lazy tlb mm refcounting to be configurable
Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm when it is context switched. This can be disabled by architecture
lazy tlb: allow lazy tlb mm refcounting to be configurable
Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm when it is context switched. This can be disabled by architectures that don't require this refcounting if they clean up lazy tlb mms when the last refcount is dropped. Currently this is always enabled, so the patch introduces no functional change.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nicholas Piggin <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
aa464ba9 |
| 03-Feb-2023 |
Nicholas Piggin <[email protected]> |
lazy tlb: introduce lazy tlb mm refcount helper functions
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the
lazy tlb: introduce lazy tlb mm refcount helper functions
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. There is no functional change with this patch.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nicholas Piggin <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
400b9b93 |
| 12-Mar-2023 |
Kirill A. Shutemov <[email protected]> |
iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
Kernel has few users of pasid_valid() and all but one checks if the process has PASID allocated. The helper takes ioasid_t as the input.
iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
Kernel has few users of pasid_valid() and all but one checks if the process has PASID allocated. The helper takes ioasid_t as the input.
Replace the helper with mm_valid_pasid() that takes mm_struct as the argument. The only call that checks PASID that is not tied to mm_struct is open-codded now.
This is preparatory patch. It helps avoid ifdeffery: no need to dereference mm->pasid in generic code to check if the process has PASID.
Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/all/20230312112612.31869-11-kirill.shutemov%40linux.intel.com
show more ...
|
|
Revision tags: v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5 |
|
| #
ee65728e |
| 27-Jun-2022 |
Mike Rapoport <[email protected]> |
docs: rename Documentation/vm to Documentation/mm
so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines.
Signed-off-by: M
docs: rename Documentation/vm to Documentation/mm
so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines.
Signed-off-by: Mike Rapoport <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Tested-by: Ira Weiny <[email protected]> Acked-by: Jonathan Corbet <[email protected]> Acked-by: Wu XiangCheng <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2 |
|
| #
2cb4de08 |
| 09-Apr-2022 |
Christophe Leroy <[email protected]> |
mm: Add len and flags parameters to arch_get_mmap_end()
Powerpc needs flags and len to make decision on arch_get_mmap_end().
So add them as parameters to arch_get_mmap_end().
Signed-off-by: Christ
mm: Add len and flags parameters to arch_get_mmap_end()
Powerpc needs flags and len to make decision on arch_get_mmap_end().
So add them as parameters to arch_get_mmap_end().
Signed-off-by: Christophe Leroy <[email protected]> Acked-by: Catalin Marinas <[email protected]> Acked-by: Andrew Morton <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/b556daabe7d2bdb2361c4d6130280da7c1ba2c14.1649523076.git.christophe.leroy@csgroup.eu
show more ...
|
| #
4b439e25 |
| 09-Apr-2022 |
Christophe Leroy <[email protected]> |
mm, hugetlbfs: Allow an arch to always use generic versions of get_unmapped_area functions
Unlike most architectures, powerpc can only define at runtime if it is going to use the generic arch_get_un
mm, hugetlbfs: Allow an arch to always use generic versions of get_unmapped_area functions
Unlike most architectures, powerpc can only define at runtime if it is going to use the generic arch_get_unmapped_area() or not.
Today, powerpc has a copy of the generic arch_get_unmapped_area() because when selection HAVE_ARCH_UNMAPPED_AREA the generic arch_get_unmapped_area() is not available.
Rename it generic_get_unmapped_area() and make it independent of HAVE_ARCH_UNMAPPED_AREA.
Do the same for arch_get_unmapped_area_topdown() versus HAVE_ARCH_UNMAPPED_AREA_TOPDOWN.
Do the same for hugetlb_get_unmapped_area() versus HAVE_ARCH_HUGETLB_UNMAPPED_AREA.
Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]> Acked-by: Andrew Morton <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/77f9d3e592f1c8511df9381aa1c4e754651da4d1.1649523076.git.christophe.leroy@csgroup.eu
show more ...
|
| #
5f24d5a5 |
| 21-Apr-2022 |
Christophe Leroy <[email protected]> |
mm, hugetlb: allow for "high" userspace addresses
This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") for hugetlb.
This patch adds support for "high" userspace
mm, hugetlb: allow for "high" userspace addresses
This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") for hugetlb.
This patch adds support for "high" userspace addresses that are optionally supported on the system and have to be requested via a hint mechanism ("high" addr parameter to mmap).
Architectures such as powerpc and x86 achieve this by making changes to their architectural versions of hugetlb_get_unmapped_area() function. However, arm64 uses the generic version of that function.
So take into account arch_get_mmap_base() and arch_get_mmap_end() in hugetlb_get_unmapped_area(). To allow that, move those two macros out of mm/mmap.c into include/linux/sched/mm.h
If these macros are not defined in architectural code then they default to (TASK_SIZE) and (base) so should not introduce any behavioural changes to architectures that do not define them.
For the time being, only ARM64 is affected by this change.
Catalin (ARM64) said "We should have fixed hugetlb_get_unmapped_area() as well when we added support for 52-bit VA. The reason for commit f6795053dac8 was to prevent normal mmap() from returning addresses above 48-bit by default as some user-space had hard assumptions about this.
It's a slight ABI change if you do this for hugetlb_get_unmapped_area() but I doubt anyone would notice. It's more likely that the current behaviour would cause issues, so I'd rather have them consistent.
Basically when arm64 gained support for 52-bit addresses we did not want user-space calling mmap() to suddenly get such high addresses, otherwise we could have inadvertently broken some programs (similar behaviour to x86 here). Hence we added commit f6795053dac8. But we missed hugetlbfs which could still get such high mmap() addresses. So in theory that's a potential regression that should have bee addressed at the same time as commit f6795053dac8 (and before arm64 enabled 52-bit addresses)"
Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Steve Capper <[email protected]> Cc: Will Deacon <[email protected]> Cc: <[email protected]> [5.0.x] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4 |
|
| #
701fac40 |
| 07-Feb-2022 |
Fenghua Yu <[email protected]> |
iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit
PASIDs are process-wide. It was attempted to use refcounted PASIDs to free them when the last thread drops the refcount. Th
iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit
PASIDs are process-wide. It was attempted to use refcounted PASIDs to free them when the last thread drops the refcount. This turned out to be complex and error prone. Given the fact that the PASID space is 20 bits, which allows up to 1M processes to have a PASID associated concurrently, PASID resource exhaustion is not a realistic concern.
Therefore, it was decided to simplify the approach and stick with lazy on demand PASID allocation, but drop the eager free approach and make an allocated PASID's lifetime bound to the lifetime of the process.
Get rid of the refcounting mechanisms and replace/rename the interfaces to reflect this new approach.
[ bp: Massage commit message. ]
Suggested-by: Dave Hansen <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Jacob Pan <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Joerg Roedel <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
a6cbd440 |
| 07-Feb-2022 |
Fenghua Yu <[email protected]> |
kernel/fork: Initialize mm's PASID
A new mm doesn't have a PASID yet when it's created. Initialize the mm's PASID on fork() or for init_mm to INVALID_IOASID (-1).
INIT_PASID (0) is reserved for ker
kernel/fork: Initialize mm's PASID
A new mm doesn't have a PASID yet when it's created. Initialize the mm's PASID on fork() or for init_mm to INVALID_IOASID (-1).
INIT_PASID (0) is reserved for kernel legacy DMA PASID. It cannot be allocated to a user process. Initializing the process's PASID to 0 may cause confusion that's why the process uses the reserved kernel legacy DMA PASID. Initializing the PASID to INVALID_IOASID (-1) explicitly tells the process doesn't have a valid PASID yet.
Even though the only user of mm_pasid_init() is in fork.c, define it in <linux/sched/mm.h> as the first of three mm/pasid life cycle functions (init/set/drop) to keep these all together.
Suggested-by: Dave Hansen <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.17-rc3, v5.17-rc2, v5.17-rc1 |
|
| #
4034247a |
| 14-Jan-2022 |
NeilBrown <[email protected]> |
mm: introduce memalloc_retry_wait()
Various places in the kernel - largely in filesystems - respond to a memory allocation failure by looping around and re-trying. Some of these cannot conveniently
mm: introduce memalloc_retry_wait()
Various places in the kernel - largely in filesystems - respond to a memory allocation failure by looping around and re-trying. Some of these cannot conveniently use __GFP_NOFAIL, for reasons such as:
- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on - a need to check for the process being signalled between failures - the possibility that other recovery actions could be performed - the allocation is quite deep in support code, and passing down an extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
Many of these currently use congestion_wait() which (in almost all cases) simply waits the given timeout - congestion isn't tracked for most devices.
It isn't clear what the best delay is for loops, but it is clear that the various filesystems shouldn't be responsible for choosing a timeout.
This patch introduces memalloc_retry_wait() with takes on that responsibility. Code that wants to retry a memory allocation can call this function passing the GFP flags that were used. It will wait however is appropriate.
For now, it only considers __GFP_NORETRY and whatever gfpflags_allow_blocking() tests. If blocking is allowed without __GFP_NORETRY, then alloc_page either made some reclaim progress, or waited for a while, before failing. So there is no need for much further waiting. memalloc_retry_wait() will wait until the current jiffie ends. If this condition is not met, then alloc_page() won't have waited much if at all. In that case memalloc_retry_wait() waits about 200ms. This is the delay that most current loops uses.
linux/sched/mm.h needs to be included in some files now, but linux/backing-dev.h does not.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Chao Yu <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Chuck Lever <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4 |
|
| #
8d491de6 |
| 28-Sep-2021 |
Thomas Gleixner <[email protected]> |
sched: Move mmdrop to RCU on RT
mmdrop() is invoked from finish_task_switch() by the incoming task to drop the mm which was handed over by the previous task. mmdrop() can be quite expensive which pr
sched: Move mmdrop to RCU on RT
mmdrop() is invoked from finish_task_switch() by the incoming task to drop the mm which was handed over by the previous task. mmdrop() can be quite expensive which prevents an incoming real-time task from getting useful work done.
Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels it delagates the eventually required invocation of __mmdrop() to RCU.
Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.15-rc3, v5.15-rc2, v5.15-rc1 |
|
| #
55a68c82 |
| 02-Sep-2021 |
Vasily Averin <[email protected]> |
memcg: replace in_interrupt() by !in_task() in active_memcg()
set_active_memcg() uses in_interrupt() check to select proper storage for cgroup: pointer on task struct or per-cpu pointer.
It isn't f
memcg: replace in_interrupt() by !in_task() in active_memcg()
set_active_memcg() uses in_interrupt() check to select proper storage for cgroup: pointer on task struct or per-cpu pointer.
It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH. It's better to use '!in_task()' instead.
Link: https://lkml.org/lkml/2021/7/26/487 Link: https://lkml.kernel.org/r/[email protected] Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts") Signed-off-by: Vasily Averin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
4f3eaf45 |
| 02-Sep-2021 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: report a more useful address for reclaim acquisition
A recent lockdep report included these lines:
[ 96.177910] 3 locks held by containerd/770: [ 96.177934] #0: ffff88810815ea28 (&mm->mmap
mm: report a more useful address for reclaim acquisition
A recent lockdep report included these lines:
[ 96.177910] 3 locks held by containerd/770: [ 96.177934] #0: ffff88810815ea28 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x115/0x770 [ 96.177999] #1: ffffffff82915020 (rcu_read_lock){....}-{1:2}, at: get_swap_device+0x33/0x140 [ 96.178057] #2: ffffffff82955ba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
While it was not useful to that bug report to know where the reclaim lock had been acquired, it might be useful under other circumstances. Allow the caller of __fs_reclaim_acquire to specify the instruction pointer to use.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Omar Sandoval <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|