|
Revision tags: v6.15, v6.15-rc7 |
|
| #
70d1eb03 |
| 15-May-2025 |
Kees Cook <[email protected]> |
mm: vmalloc: only zero-init on vrealloc shrink
The common case is to grow reallocations, and since init_on_alloc will have already zeroed the whole allocation, we only need to zero when shrinking th
mm: vmalloc: only zero-init on vrealloc shrink
The common case is to grow reallocations, and since init_on_alloc will have already zeroed the whole allocation, we only need to zero when shrinking the allocation.
Link: https://lkml.kernel.org/r/[email protected] Fixes: a0309faf1cb0 ("mm: vmalloc: support more granular vrealloc() sizing") Signed-off-by: Kees Cook <[email protected]> Tested-by: Pawan Gupta <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Eduard Zingerman <[email protected]> Cc: "Erhard F." <[email protected]> Cc: Shung-Hsi Yu <[email protected]> Cc: "Uladzislau Rezki (Sony)" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
f7a35a3c |
| 15-May-2025 |
Kees Cook <[email protected]> |
mm: vmalloc: actually use the in-place vrealloc region
Patch series "mm: vmalloc: Actually use the in-place vrealloc region".
This fixes a performance regression[1] with vrealloc()[1].
The refact
mm: vmalloc: actually use the in-place vrealloc region
Patch series "mm: vmalloc: Actually use the in-place vrealloc region".
This fixes a performance regression[1] with vrealloc()[1].
The refactoring to not build a new vmalloc region only actually worked when shrinking. Actually return the resized area when it grows. Ugh.
Link: https://lkml.kernel.org/r/[email protected] Fixes: a0309faf1cb0 ("mm: vmalloc: support more granular vrealloc() sizing") Signed-off-by: Kees Cook <[email protected]> Reported-by: Shung-Hsi Yu <[email protected]> Closes: https://lore.kernel.org/all/20250515-bpf-verifier-slowdown-vwo2meju4cgp2su5ckj@6gi6ssxbnfqg [1] Tested-by: Eduard Zingerman <[email protected]> Tested-by: Pawan Gupta <[email protected]> Tested-by: Shung-Hsi Yu <[email protected]> Reviewed-by: "Uladzislau Rezki (Sony)" <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Cc: "Erhard F." <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc6, v6.15-rc5, v6.15-rc4 |
|
| #
a0309faf |
| 26-Apr-2025 |
Kees Cook <[email protected]> |
mm: vmalloc: support more granular vrealloc() sizing
Introduce struct vm_struct::requested_size so that the requested (re)allocation size is retained separately from the allocated area size. This m
mm: vmalloc: support more granular vrealloc() sizing
Introduce struct vm_struct::requested_size so that the requested (re)allocation size is retained separately from the allocated area size. This means that KASAN will correctly poison the correct spans of requested bytes. This also means we can support growing the usable portion of an allocation that can already be supported by the existing area's existing allocation.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 3ddc2fefe6f3 ("mm: vmalloc: implement vrealloc()") Signed-off-by: Kees Cook <[email protected]> Reported-by: Erhard Furtner <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Danilo Krummrich <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Uladzislau Rezki (Sony)" <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
f0e11a99 |
| 06-Mar-2025 |
Liu Ye <[email protected]> |
mm/vmalloc: refactor __vmalloc_node_range_noprof()
According to the code logic, the first parameter of the sub-function __get_vm_area_node() should be size instead of real_size.
Then in __get_vm_ar
mm/vmalloc: refactor __vmalloc_node_range_noprof()
According to the code logic, the first parameter of the sub-function __get_vm_area_node() should be size instead of real_size.
Then in __get_vm_area_node(), the size will be aligned, so the redundant alignment operation is deleted.
The use of the real_size variable causes code redundancy, so it is removed to simplify the code.
The real prefix is generally used to indicate the adjusted value of a parameter, but according to the code logic, it should indicate the original value, so it is recommended to rename it to original_align.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liu Ye <[email protected]> Reviewed-by: "Uladzislau Rezki (Sony)" <[email protected]> Cc: Christop Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5 |
|
| #
3685024e |
| 26-Feb-2025 |
Ryan Roberts <[email protected]> |
mm: don't skip arch_sync_kernel_mappings() in error paths
Fix callers that previously skipped calling arch_sync_kernel_mappings() if an error occurred during a pgtable update. The call is still req
mm: don't skip arch_sync_kernel_mappings() in error paths
Fix callers that previously skipped calling arch_sync_kernel_mappings() if an error occurred during a pgtable update. The call is still required to sync any pgtable updates that may have occurred prior to hitting the error condition.
These are theoretical bugs discovered during code review.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified") Fixes: 0c95cba49255 ("mm: apply_to_pte_range warn and fail if a large pte is encountered") Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Christop Hellwig <[email protected]> Cc: "Uladzislau Rezki (Sony)" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5 |
|
| #
6bf9b5b4 |
| 23-Dec-2024 |
Luiz Capitulino <[email protected]> |
mm: alloc_pages_bulk: rename API
The previous commit removed the page_list argument from alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function.
Now that only the *_array() flavo
mm: alloc_pages_bulk: rename API
The previous commit removed the page_list argument from alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function.
Now that only the *_array() flavour of the API remains, we can do the following renaming (along with the _noprof() ones):
alloc_pages_bulk_array -> alloc_pages_bulk alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy alloc_pages_bulk_array_node -> alloc_pages_bulk_node
Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Yunsheng Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc4, v6.13-rc3 |
|
| #
a2e740e2 |
| 11-Dec-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
vmalloc: fix accounting with i915
If the caller of vmap() specifies VM_MAP_PUT_PAGES (currently only the i915 driver), we will decrement nr_vmalloc_pages and MEMCG_VMALLOC in vfree(). These counter
vmalloc: fix accounting with i915
If the caller of vmap() specifies VM_MAP_PUT_PAGES (currently only the i915 driver), we will decrement nr_vmalloc_pages and MEMCG_VMALLOC in vfree(). These counters are incremented by vmalloc() but not by vmap() so this will cause an underflow. Check the VM_MAP_PUT_PAGES flag before decrementing either counter.
Link: https://lkml.kernel.org/r/[email protected] Fixes: b944afc9d64d ("mm: add a VM_MAP_PUT_PAGES flag for vmap") Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Balbir Singh <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: "Uladzislau Rezki (Sony)" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1 |
|
| #
d699440f |
| 26-Nov-2024 |
Andrii Nakryiko <[email protected]> |
mm: fix vrealloc()'s KASAN poisoning logic
When vrealloc() reuses already allocated vmap_area, we need to re-annotate poisoned and unpoisoned portions of underlying memory according to the new size.
mm: fix vrealloc()'s KASAN poisoning logic
When vrealloc() reuses already allocated vmap_area, we need to re-annotate poisoned and unpoisoned portions of underlying memory according to the new size.
This results in a KASAN splat recorded at [1]. A KASAN mis-reporting issue where there is none.
Note, hard-coding KASAN_VMALLOC_PROT_NORMAL might not be exactly correct, but KASAN flag logic is pretty involved and spread out throughout __vmalloc_node_range_noprof(), so I'm using the bare minimum flag here and leaving the rest to mm people to refactor this logic and reuse it here.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/bpf/[email protected]/ [1] Fixes: 3ddc2fefe6f3 ("mm: vmalloc: implement vrealloc()") Signed-off-by: Andrii Nakryiko <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5 |
|
| #
0f9b6856 |
| 23-Oct-2024 |
Suren Baghdasaryan <[email protected]> |
alloc_tag: populate memory for module tags as needed
The memory reserved for module tags does not need to be backed by physical pages until there are tags to store there. Change the way we reserve
alloc_tag: populate memory for module tags as needed
The memory reserved for module tags does not need to be backed by physical pages until there are tags to store there. Change the way we reserve this memory to allocate only virtual area for the tags and populate it with physical pages as needed when we load a module.
[[email protected]: avoid execmem_vmap() when !MMU] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Pasha Tatashin <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Gomez <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Petr Pavlu <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sami Tolvanen <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Huth <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xiongwei Song <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
2e45474a |
| 23-Oct-2024 |
Mike Rapoport (Microsoft) <[email protected]> |
execmem: add support for cache of large ROX pages
Using large pages to map text areas reduces iTLB pressure and improves performance.
Extend execmem_alloc() with an ability to use huge pages with R
execmem: add support for cache of large ROX pages
Using large pages to map text areas reduces iTLB pressure and improves performance.
Extend execmem_alloc() with an ability to use huge pages with ROX permissions as a cache for smaller allocations.
To populate the cache, a writable large page is allocated from vmalloc with VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as ROX.
The direct map alias of that large page is exculded from the direct map.
Portions of that large page are handed out to execmem_alloc() callers without any changes to the permissions.
When the memory is freed with execmem_free() it is invalidated again so that it won't contain stale instructions.
An architecture has to implement execmem_fill_trapping_insns() callback and select ARCH_HAS_EXECMEM_ROX configuration option to be able to use the ROX cache.
The cache is enabled on per-range basis when an architecture sets EXECMEM_ROX_CACHE flag in definition of an execmem_range.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Tested-by: kdevops <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Guo Ren <[email protected]> Cc: Helge Deller <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Johannes Berg <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masami Hiramatsu (Google) <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Song Liu <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
c82be0be |
| 23-Oct-2024 |
Mike Rapoport (Microsoft) <[email protected]> |
mm: vmalloc: don't account for number of nodes for HUGE_VMAP allocations
vmalloc allocations with VM_ALLOW_HUGE_VMAP that do not explicitly specify node ID will use huge pages only if size_per_node
mm: vmalloc: don't account for number of nodes for HUGE_VMAP allocations
vmalloc allocations with VM_ALLOW_HUGE_VMAP that do not explicitly specify node ID will use huge pages only if size_per_node is larger than a huge page.
Still the actual allocated memory is not distributed between nodes and there is no advantage in such approach. On the contrary, BPF allocates SZ_2M * num_possible_nodes() for each new bpf_prog_pack, while it could do with a single huge page per pack.
Don't account for number of nodes for VM_ALLOW_HUGE_VMAP with NUMA_NO_NODE and use huge pages whenever the requested allocation size is larger than a huge page.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Tested-by: kdevops <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Guo Ren <[email protected]> Cc: Helge Deller <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Johannes Berg <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masami Hiramatsu (Google) <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Song Liu <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
9e9e085e |
| 26-Jul-2024 |
Adrian Huang <[email protected]> |
mm/vmalloc: combine all TLB flush operations of KASAN shadow virtual address into one operation
When compiling kernel source 'make -j $(nproc)' with the up-and-running KASAN-enabled kernel on a 256-
mm/vmalloc: combine all TLB flush operations of KASAN shadow virtual address into one operation
When compiling kernel source 'make -j $(nproc)' with the up-and-running KASAN-enabled kernel on a 256-core machine, the following soft lockup is shown:
watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760] CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95 Workqueue: events drain_vmap_area_work RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0 Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75 RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202 RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949 RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50 RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800 R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39 R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0 Call Trace: <IRQ> ? watchdog_timer_fn+0x2cd/0x390 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0x300/0x6d0 ? sched_clock_cpu+0x69/0x4e0 ? __pfx___hrtimer_run_queues+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? ktime_get_update_offsets_now+0x7f/0x2a0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? hrtimer_interrupt+0x2ca/0x760 ? __sysvec_apic_timer_interrupt+0x8c/0x2b0 ? sysvec_apic_timer_interrupt+0x6a/0x90 </IRQ> <TASK> ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? smp_call_function_many_cond+0x1d8/0xbb0 ? __pfx_do_kernel_range_flush+0x10/0x10 on_each_cpu_cond_mask+0x20/0x40 flush_tlb_kernel_range+0x19b/0x250 ? srso_return_thunk+0x5/0x5f ? kasan_release_vmalloc+0xa7/0xc0 purge_vmap_node+0x357/0x820 ? __pfx_purge_vmap_node+0x10/0x10 __purge_vmap_area_lazy+0x5b8/0xa10 drain_vmap_area_work+0x21/0x30 process_one_work+0x661/0x10b0 worker_thread+0x844/0x10e0 ? srso_return_thunk+0x5/0x5f ? __kthread_parkme+0x82/0x140 ? __pfx_worker_thread+0x10/0x10 kthread+0x2a5/0x370 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK>
Debugging Analysis:
1. The following ftrace log shows that the lockup CPU spends too much time iterating vmap_nodes and flushing TLB when purging vm_area structures. (Some info is trimmed).
kworker: funcgraph_entry: | drain_vmap_area_work() { kworker: funcgraph_entry: | mutex_lock() { kworker: funcgraph_entry: 1.092 us | __cond_resched(); kworker: funcgraph_exit: 3.306 us | } ... ... kworker: funcgraph_entry: | flush_tlb_kernel_range() { ... ... kworker: funcgraph_exit: # 7533.649 us | } ... ... kworker: funcgraph_entry: 2.344 us | mutex_unlock(); kworker: funcgraph_exit: $ 23871554 us | }
The drain_vmap_area_work() spends over 23 seconds.
There are 2805 flush_tlb_kernel_range() calls in the ftrace log. * One is called in __purge_vmap_area_lazy(). * Others are called by purge_vmap_node->kasan_release_vmalloc. purge_vmap_node() iteratively releases kasan vmalloc allocations and flushes TLB for each vmap_area. - [Rough calculation] Each flush_tlb_kernel_range() runs about 7.5ms. -- 2804 * 7.5ms = 21.03 seconds. -- That's why a soft lock is triggered.
2. Extending the soft lockup time can work around the issue (For example, # echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the above-mentioned speculation: drain_vmap_area_work() spends too much time.
If we combine all TLB flush operations of the KASAN shadow virtual address into one operation in the call path 'purge_vmap_node()->kasan_release_vmalloc()', the running time of drain_vmap_area_work() can be saved greatly. The idea is from the flush_tlb_kernel_range() call in __purge_vmap_area_lazy(). And, the soft lockup won't be triggered.
Here is the test result based on 6.10:
[6.10 wo/ the patch] 1. ftrace latency profiling (record a trace if the latency > 20s). echo 20000000 > /sys/kernel/debug/tracing/tracing_thresh echo drain_vmap_area_work > /sys/kernel/debug/tracing/set_graph_function echo function_graph > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/kernel/debug/tracing/tracing_on
2. Run `make -j $(nproc)` to compile the kernel source
3. Once the soft lockup is reproduced, check the ftrace log: cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 76) $ 50412985 us | } /* __purge_vmap_area_lazy */ 76) $ 50412997 us | } /* drain_vmap_area_work */ 76) $ 29165911 us | } /* __purge_vmap_area_lazy */ 76) $ 29165926 us | } /* drain_vmap_area_work */ 91) $ 53629423 us | } /* __purge_vmap_area_lazy */ 91) $ 53629434 us | } /* drain_vmap_area_work */ 91) $ 28121014 us | } /* __purge_vmap_area_lazy */ 91) $ 28121026 us | } /* drain_vmap_area_work */
[6.10 w/ the patch] 1. Repeat step 1-2 in "[6.10 wo/ the patch]"
2. The soft lockup is not triggered and ftrace log is empty. cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | |
3. Setting 'tracing_thresh' to 10/5 seconds does not get any ftrace log.
4. Setting 'tracing_thresh' to 1 second gets ftrace log. cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 23) $ 1074942 us | } /* __purge_vmap_area_lazy */ 23) $ 1074950 us | } /* drain_vmap_area_work */
The worst execution time of drain_vmap_area_work() is about 1 second.
Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 282631cb2447 ("mm: vmalloc: remove global purge_vmap_area_root rb-tree") Signed-off-by: Adrian Huang <[email protected]> Co-developed-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Tested-by: Jiwei Sun <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
6004fe00 |
| 06-Sep-2024 |
Uladzislau Rezki (Sony) <[email protected]> |
mm/vmalloc.c: use "high-order" in description non 0-order pages
In many places, in the comments, we use both "higher-order" and "high-order" to describe the non 0-order pages. That is confusing, be
mm/vmalloc.c: use "high-order" in description non 0-order pages
In many places, in the comments, we use both "higher-order" and "high-order" to describe the non 0-order pages. That is confusing, because a "higher-order" statement does not reflect what it is compared with.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Suggested-by: Baoquan He <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
b44f71e3 |
| 06-Sep-2024 |
ZhangPeng <[email protected]> |
mm/vmalloc.c: use helper function va_size()
Use helper function va_size() to improve code readability. No functional modification involved.
Link: https://lkml.kernel.org/r/20240906102539.3537207-1-
mm/vmalloc.c: use helper function va_size()
Use helper function va_size() to improve code readability. No functional modification involved.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: ZhangPeng <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
7ae12a57 |
| 28-Aug-2024 |
Hongbo Li <[email protected]> |
mm/vmalloc.c: make use of the helper macro LIST_HEAD()
list_head can be initialized automatically with LIST_HEAD() instead of calling INIT_LIST_HEAD(). Here we can simplify the code.
Link: https:/
mm/vmalloc.c: make use of the helper macro LIST_HEAD()
list_head can be initialized automatically with LIST_HEAD() instead of calling INIT_LIST_HEAD(). Here we can simplify the code.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hongbo Li <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
7de8728f |
| 27-Aug-2024 |
Uladzislau Rezki (Sony) <[email protected]> |
mm: vmalloc: refactor vm_area_alloc_pages() function
The aim is to simplify and making the vm_area_alloc_pages() function less confusing as it became more clogged nowadays:
- eliminate a "bulk_gfp"
mm: vmalloc: refactor vm_area_alloc_pages() function
The aim is to simplify and making the vm_area_alloc_pages() function less confusing as it became more clogged nowadays:
- eliminate a "bulk_gfp" variable and do not overwrite a gfp flag for bulk allocator; - drop __GFP_NOFAIL flag for high-order-page requests on upper layer. It becomes less spread between levels when it comes to __GFP_NOFAIL allocations; - add a comment about a fallback path if high-order attempt is unsuccessful because for such cases __GFP_NOFAIL is dropped; - fix a typo in a commit message.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
6963f008 |
| 13-Aug-2024 |
Miao Wang <[email protected]> |
mm: vmalloc: add optimization hint on page existence check
In commit 21e516b913c1 ("mm: vmalloc: dump page owner info if page is already mapped"), a BUG_ON macro was changed into an if statement, wh
mm: vmalloc: add optimization hint on page existence check
In commit 21e516b913c1 ("mm: vmalloc: dump page owner info if page is already mapped"), a BUG_ON macro was changed into an if statement, where the compiler optimization hint introduced in the BUG_ON macro was removed along with this change. This patch adds back the hint.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 21e516b913c1 ("mm: vmalloc: dump page owner info if page is already mapped") Signed-off-by: Miao Wang <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Hariom Panthi <[email protected]> Cc: "Uladzislau Rezki (Sony)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
3ddc2fef |
| 22-Jul-2024 |
Danilo Krummrich <[email protected]> |
mm: vmalloc: implement vrealloc()
Patch series "Align kvrealloc() with krealloc()", v2.
Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency
mm: vmalloc: implement vrealloc()
Patch series "Align kvrealloc() with krealloc()", v2.
Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency in their function signatures and behavior:
- krealloc() frees the memory when the requested size is zero, whereas kvrealloc() simply returns a pointer to the existing allocation.
- krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas kvrealloc() does not accept a NULL pointer at all and, if passed, would fault instead.
- krealloc() is self-contained, whereas kvrealloc() relies on the caller to provide the size of the previous allocation.
Inconsistent behavior throughout allocation APIs is error prone, hence make kvrealloc() behave like krealloc(), which seems superior in all mentioned aspects.
In order to be able to get rid of kvrealloc()'s oldsize parameter, introduce vrealloc() and make use of it in kvrealloc().
Making use of vrealloc() in kvrealloc() also provides oppertunities to grow (and shrink) allocations more efficiently. For instance, vrealloc() can be optimized to allocate and map additional pages to grow the allocation or unmap and free unused pages to shrink the allocation.
Besides the above, those functions are required by Rust's allocator abstractons [1] (rework based on this series in [2]). With `Vec` or `KVec` respectively, potentially growing (and shrinking) data structures are rather common.
[1] https://lore.kernel.org/lkml/[email protected]/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/dakr/linux.git/log/?h=rust/mm
This patch (of 2):
Implement vrealloc() analogous to krealloc().
Currently, krealloc() requires the caller to pass the size of the previous memory allocation, which, instead, should be self-contained.
We attempt to fix this in a subsequent patch which, in order to do so, requires vrealloc().
Besides that, we need realloc() functions for kernel allocators in Rust too. With `Vec` or `KVec` respectively, potentially growing (and shrinking) data structures are rather common.
[[email protected]: fix missing nommu implementation] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: document concurrency restrictions] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: consider spare memory for __GFP_ZERO] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Danilo Krummrich <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Christian König <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kees Cook <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Uladzislau Rezki <[email protected]> Cc: Wedson Almeida Filho <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
409faf8c |
| 29-Aug-2024 |
Adrian Huang <[email protected]> |
mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area
When running the vmalloc stress on a 448-core system, observe the average latency of purge_vmap_node() is about 2 seconds by
mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area
When running the vmalloc stress on a 448-core system, observe the average latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc 'funclatency.py' tool [1].
# /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 29 | | 4 -> 7 : 19 | | 8 -> 15 : 56 | | 16 -> 31 : 483 |**** | 32 -> 63 : 1548 |************ | 64 -> 127 : 2634 |********************* | 128 -> 255 : 2535 |********************* | 256 -> 511 : 1776 |************** | 512 -> 1023 : 1015 |******** | 1024 -> 2047 : 573 |**** | 2048 -> 4095 : 488 |**** | 4096 -> 8191 : 1091 |********* | 8192 -> 16383 : 3078 |************************* | 16384 -> 32767 : 4821 |****************************************| 32768 -> 65535 : 3318 |*************************** | 65536 -> 131071 : 1718 |************** | 131072 -> 262143 : 2220 |****************** | 262144 -> 524287 : 1147 |********* | 524288 -> 1048575 : 1179 |********* | 1048576 -> 2097151 : 822 |****** | 2097152 -> 4194303 : 906 |******* | 4194304 -> 8388607 : 2148 |***************** | 8388608 -> 16777215 : 4497 |************************************* | 16777216 -> 33554431 : 289 |** |
avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
The worst case is over 16-33 seconds, so soft lockup is triggered [2].
[Root Cause] 1) Each purge_list has the long list. The following shows the number of vmap_area is purged.
crash> p vmap_nodes vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000 crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged nr_purged = 663070 ... nr_purged = 821670 nr_purged = 692214 nr_purged = 726808 ...
2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic operation when purging each vmap_area. However, the iteration is over 600000 vmap_area (See 'nr_purged' above).
Here is objdump output:
$ objdump -D vmlinux ffffffff813e8c80 <purge_vmap_node>: ... ffffffff813e8d70: f0 48 29 2d 68 0c bb lock sub %rbp,0x2bb0c68(%rip) ...
Quote from "Instruction tables" pdf file [3]: Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices, then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems.
That's why the latency of purge_vmap_node() dramatically increases on a many-core system: One core is busy on purging each vmap_area of the *long* purge_list and executing atomic_long_sub() for each vmap_area, while other cores free vmalloc allocations and execute atomic_long_add_return() in free_vmap_area_noflush().
[Solution] Employ a local variable to record the total purged pages, and execute atomic_long_sub() after the traversal of the purge_list is done. The experiment result shows the latency improvement is 99%.
[Experiment Result] 1) System Configuration: Three servers (with HT-enabled) are tested. * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1 * 192-core server: 5th Gen Intel Xeon Scalable Processor*2 * 448-core server: AMD Zen 4 Processor*2
2) Kernel Config * CONFIG_KASAN is disabled
3) The data in column "w/o patch" and "w/ patch" * Unit: micro seconds (us) * Each data is the average of 3-time measurements
System w/o patch (us) w/ patch (us) Improvement (%) --------------- -------------- ------------- ------------- 72-core server 2194 14 99.36% 192-core server 143799 1139 99.21% 448-core server 1992122 6883 99.65%
[1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 [3] https://www.agner.org/optimize/instruction_tables.pdf
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Adrian Huang <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
3e3de794 |
| 12-Aug-2024 |
Will Deacon <[email protected]> |
mm: vmalloc: ensure vmap_block is initialised before adding to queue
Commit 8c61291fd850 ("mm: fix incorrect vbq reference in purge_fragmented_block") extended the 'vmap_block' structure to contain
mm: vmalloc: ensure vmap_block is initialised before adding to queue
Commit 8c61291fd850 ("mm: fix incorrect vbq reference in purge_fragmented_block") extended the 'vmap_block' structure to contain a 'cpu' field which is set at allocation time to the id of the initialising CPU.
When a new 'vmap_block' is being instantiated by new_vmap_block(), the partially initialised structure is added to the local 'vmap_block_queue' xarray before the 'cpu' field has been initialised. If another CPU is concurrently walking the xarray (e.g. via vm_unmap_aliases()), then it may perform an out-of-bounds access to the remote queue thanks to an uninitialised index.
This has been observed as UBSAN errors in Android:
| Internal error: UBSAN: array index out of bounds: 00000000f2005512 [#1] PREEMPT SMP | | Call trace: | purge_fragmented_block+0x204/0x21c | _vm_unmap_aliases+0x170/0x378 | vm_unmap_aliases+0x1c/0x28 | change_memory_common+0x1dc/0x26c | set_memory_ro+0x18/0x24 | module_enable_ro+0x98/0x238 | do_init_module+0x1b0/0x310
Move the initialisation of 'vb->cpu' in new_vmap_block() ahead of the addition to the xarray.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 8c61291fd850 ("mm: fix incorrect vbq reference in purge_fragmented_block") Signed-off-by: Will Deacon <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Zhaoyang Huang <[email protected]> Cc: Hailong.Liu <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
61ebe5a7 |
| 08-Aug-2024 |
Hailong Liu <[email protected]> |
mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0
The __vmap_pages_range_noflush() assumes its argument pages** contains pages with the same page shift. Howe
mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0
The __vmap_pages_range_noflush() assumes its argument pages** contains pages with the same page shift. However, since commit e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations"), if gfp_flags includes __GFP_NOFAIL with high order in vm_area_alloc_pages() and page allocation failed for high order, the pages** may contain two different page shifts (high order and order-0). This could lead __vmap_pages_range_noflush() to perform incorrect mappings, potentially resulting in memory corruption.
Users might encounter this as follows (vmap_allow_huge = true, 2M is for PMD_SIZE):
kvmalloc(2M, __GFP_NOFAIL|GFP_X) __vmalloc_node_range_noprof(vm_flags=VM_ALLOW_HUGE_VMAP) vm_area_alloc_pages(order=9) ---> order-9 allocation failed and fallback to order-0 vmap_pages_range() vmap_pages_range_noflush() __vmap_pages_range_noflush(page_shift = 21) ----> wrong mapping happens
We can remove the fallback code because if a high-order allocation fails, __vmalloc_node_range_noprof() will retry with order-0. Therefore, it is unnecessary to fallback to order-0 here. Therefore, fix this by removing the fallback code.
Link: https://lkml.kernel.org/r/[email protected] Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations") Signed-off-by: Hailong Liu <[email protected]> Reported-by: Tangquan Zheng <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Barry Song <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6 |
|
| #
a34acf30 |
| 26-Jun-2024 |
Uladzislau Rezki (Sony) <[email protected]> |
mm: vmalloc: check if a hash-index is in cpu_possible_mask
The problem is that there are systems where cpu_possible_mask has gaps between set CPUs, for example SPARC. In this scenario addr_to_vb_xa
mm: vmalloc: check if a hash-index is in cpu_possible_mask
The problem is that there are systems where cpu_possible_mask has gaps between set CPUs, for example SPARC. In this scenario addr_to_vb_xa() hash function can return an index which accesses to not-possible and not setup CPU area using per_cpu() macro. This results in an oops on SPARC.
A per-cpu vmap_block_queue is also used as hash table, incorrectly assuming the cpu_possible_mask has no gaps. Fix it by adjusting an index to a next possible CPU.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 062eacf57ad9 ("mm: vmalloc: remove a global vmap_blocks xarray") Reported-by: Nick Bowler <[email protected]> Closes: https://lore.kernel.org/linux-kernel/ZntjIE6msJbF8zTa@MiWiFi-R3L-srv/T/ Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Hailong.Liu <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc5, v6.10-rc4 |
|
| #
55ccad6f |
| 10-Jun-2024 |
Shubhang Kaushik OS <[email protected]> |
vmalloc: modify the alloc_vmap_area() error message for better diagnostics
'vmap allocation for size %lu failed: use vmalloc=<size> to increase size' The above warning is seen in the kernel function
vmalloc: modify the alloc_vmap_area() error message for better diagnostics
'vmap allocation for size %lu failed: use vmalloc=<size> to increase size' The above warning is seen in the kernel functionality for allocation of the restricted virtual memory range till exhaustion.
This message is misleading because 'vmalloc=' is supported on arm32, x86 platforms and is not a valid kernel parameter on a number of other platforms (in particular its not supported on arm64, alpha, loongarch, arc, csky, hexagon, microblaze, mips, nios2, openrisc, parisc, m64k, powerpc, riscv, sh, um, xtensa, s390, sparc). With the update, the output gets modified to include the function parameters along with the start and end of the virtual memory range allowed.
The warning message after fix on kernel version 6.10.0-rc1+:
vmalloc_node_range for size 33619968 failed: Address range restricted between 0xffff800082640000 - 0xffff800084650000
Backtrace with the misleading error message:
vmap allocation for size 33619968 failed: use vmalloc=<size> to increase size insmod: vmalloc error: size 33554432, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 46 PID: 1977 Comm: insmod Tainted: G E 6.10.0-rc1+ #79 Hardware name: INGRASYS Yushan Server iSystem TEMP-S000141176+10/Yushan Motherboard, BIOS 2.10.20230517 (SCP: xxx) yyyy/mm/dd Call trace: dump_backtrace+0xa0/0x128 show_stack+0x20/0x38 dump_stack_lvl+0x78/0x90 dump_stack+0x18/0x28 warn_alloc+0x12c/0x1b8 __vmalloc_node_range_noprof+0x28c/0x7e0 custom_init+0xb4/0xfff8 [test_driver] do_one_initcall+0x60/0x290 do_init_module+0x68/0x250 load_module+0x236c/0x2428 init_module_from_file+0x8c/0xd8 __arm64_sys_finit_module+0x1b4/0x388 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x130 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198
[[email protected]: v5] Link: https://lkml.kernel.org/r/CH2PR01MB5894B0182EA0B28DF2EFB916F5C72@CH2PR01MB5894.prod.exchangelabs.com Link: https://lkml.kernel.org/r/MN2PR01MB59025CC02D1D29516527A693F5C62@MN2PR01MB5902.prod.exchangelabs.com Signed-off-by: Shubhang Kaushik <[email protected]> Reviewed-by: Christoph Lameter (Ampere) <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Guo Ren <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Cc: Xiongwei Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc3, v6.10-rc2 |
|
| #
f56810c9 |
| 28-May-2024 |
Uros Bizjak <[email protected]> |
mm/vmalloc: use __this_cpu_try_cmpxchg() in preload_this_cpu_lock()
Use __this_cpu_try_cmpxchg() instead of __this_cpu_cmpxchg (*ptr, old, new) == old in preload_this_cpu_lock(). x86 CMPXCHG instru
mm/vmalloc: use __this_cpu_try_cmpxchg() in preload_this_cpu_lock()
Use __this_cpu_try_cmpxchg() instead of __this_cpu_cmpxchg (*ptr, old, new) == old in preload_this_cpu_lock(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg.
The generated code improves from:
4bb6: 48 85 f6 test %rsi,%rsi 4bb9: 0f 84 10 fa ff ff je 45cf <...> 4bbf: 4c 89 e8 mov %r13,%rax 4bc2: 65 48 0f b1 35 00 00 cmpxchg %rsi,%gs:0x0(%rip) 4bc9: 00 00 4bcb: 48 85 c0 test %rax,%rax 4bce: 0f 84 fb f9 ff ff je 45cf <...>
to:
4bb6: 48 85 f6 test %rsi,%rsi 4bb9: 0f 84 10 fa ff ff je 45cf <...> 4bbf: 4c 89 e8 mov %r13,%rax 4bc2: 65 48 0f b1 35 00 00 cmpxchg %rsi,%gs:0x0(%rip) 4bc9: 00 00 4bcb: 0f 84 fe f9 ff ff je 45cf <...>
No functional change intended.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uros Bizjak <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8c61291f |
| 07-Jun-2024 |
Zhaoyang Huang <[email protected]> |
mm: fix incorrect vbq reference in purge_fragmented_block
xa_for_each() in _vm_unmap_aliases() loops through all vbs. However, since commit 062eacf57ad9 ("mm: vmalloc: remove a global vmap_blocks x
mm: fix incorrect vbq reference in purge_fragmented_block
xa_for_each() in _vm_unmap_aliases() loops through all vbs. However, since commit 062eacf57ad9 ("mm: vmalloc: remove a global vmap_blocks xarray") the vb from xarray may not be on the corresponding CPU vmap_block_queue. Consequently, purge_fragmented_block() might use the wrong vbq->lock to protect the free list, leading to vbq->free breakage.
Incorrect lock protection can exhaust all vmalloc space as follows: CPU0 CPU1 +--------------------------------------------+ | +--------------------+ +-----+ | +--> | |---->| |------+ | CPU1:vbq free_list | | vb1 | +--- | |<----| |<-----+ | +--------------------+ +-----+ | +--------------------------------------------+
_vm_unmap_aliases() vb_alloc() new_vmap_block() xa_for_each(&vbq->vmap_blocks, idx, vb) --> vb in CPU1:vbq->freelist
purge_fragmented_block(vb) spin_lock(&vbq->lock) spin_lock(&vbq->lock) --> use CPU0:vbq->lock --> use CPU1:vbq->lock
list_del_rcu(&vb->free_list) list_add_tail_rcu(&vb->free_list, &vbq->free) __list_del(vb->prev, vb->next) next->prev = prev +--------------------+ | | | CPU1:vbq free_list | +---| |<--+ | +--------------------+ | +----------------------------+ __list_add(new, head->prev, head) +--------------------------------------------+ | +--------------------+ +-----+ | +--> | |---->| |------+ | CPU1:vbq free_list | | vb2 | +--- | |<----| |<-----+ | +--------------------+ +-----+ | +--------------------------------------------+
prev->next = next +--------------------------------------------+ |----------------------------+ | | +--------------------+ | +-----+ | +--> | |--+ | |------+ | CPU1:vbq free_list | | vb2 | +--- | |<----| |<-----+ | +--------------------+ +-----+ | +--------------------------------------------+ Here’s a list breakdown. All vbs, which were to be added to ‘prev’, cannot be used by list_for_each_entry_rcu(vb, &vbq->free, free_list) in vb_alloc(). Thus, vmalloc space is exhausted.
This issue affects both erofs and f2fs, the stacktrace is as follows: erofs: [<ffffffd4ffb93ad4>] __switch_to+0x174 [<ffffffd4ffb942f0>] __schedule+0x624 [<ffffffd4ffb946f4>] schedule+0x7c [<ffffffd4ffb947cc>] schedule_preempt_disabled+0x24 [<ffffffd4ffb962ec>] __mutex_lock+0x374 [<ffffffd4ffb95998>] __mutex_lock_slowpath+0x14 [<ffffffd4ffb95954>] mutex_lock+0x24 [<ffffffd4fef2900c>] reclaim_and_purge_vmap_areas+0x44 [<ffffffd4fef25908>] alloc_vmap_area+0x2e0 [<ffffffd4fef24ea0>] vm_map_ram+0x1b0 [<ffffffd4ff1b46f4>] z_erofs_lz4_decompress+0x278 [<ffffffd4ff1b8ac4>] z_erofs_decompress_queue+0x650 [<ffffffd4ff1b8328>] z_erofs_runqueue+0x7f4 [<ffffffd4ff1b66a8>] z_erofs_read_folio+0x104 [<ffffffd4feeb6fec>] filemap_read_folio+0x6c [<ffffffd4feeb68c4>] filemap_fault+0x300 [<ffffffd4fef0ecac>] __do_fault+0xc8 [<ffffffd4fef0c908>] handle_mm_fault+0xb38 [<ffffffd4ffb9f008>] do_page_fault+0x288 [<ffffffd4ffb9ed64>] do_translation_fault[jt]+0x40 [<ffffffd4fec39c78>] do_mem_abort+0x58 [<ffffffd4ffb8c3e4>] el0_ia+0x70 [<ffffffd4ffb8c260>] el0t_64_sync_handler[jt]+0xb0 [<ffffffd4fec11588>] ret_to_user[jt]+0x0
f2fs: [<ffffffd4ffb93ad4>] __switch_to+0x174 [<ffffffd4ffb942f0>] __schedule+0x624 [<ffffffd4ffb946f4>] schedule+0x7c [<ffffffd4ffb947cc>] schedule_preempt_disabled+0x24 [<ffffffd4ffb962ec>] __mutex_lock+0x374 [<ffffffd4ffb95998>] __mutex_lock_slowpath+0x14 [<ffffffd4ffb95954>] mutex_lock+0x24 [<ffffffd4fef2900c>] reclaim_and_purge_vmap_areas+0x44 [<ffffffd4fef25908>] alloc_vmap_area+0x2e0 [<ffffffd4fef24ea0>] vm_map_ram+0x1b0 [<ffffffd4ff1a3b60>] f2fs_prepare_decomp_mem+0x144 [<ffffffd4ff1a6c24>] f2fs_alloc_dic+0x264 [<ffffffd4ff175468>] f2fs_read_multi_pages+0x428 [<ffffffd4ff17b46c>] f2fs_mpage_readpages+0x314 [<ffffffd4ff1785c4>] f2fs_readahead+0x50 [<ffffffd4feec3384>] read_pages+0x80 [<ffffffd4feec32c0>] page_cache_ra_unbounded+0x1a0 [<ffffffd4feec39e8>] page_cache_ra_order+0x274 [<ffffffd4feeb6cec>] do_sync_mmap_readahead+0x11c [<ffffffd4feeb6764>] filemap_fault+0x1a0 [<ffffffd4ff1423bc>] f2fs_filemap_fault+0x28 [<ffffffd4fef0ecac>] __do_fault+0xc8 [<ffffffd4fef0c908>] handle_mm_fault+0xb38 [<ffffffd4ffb9f008>] do_page_fault+0x288 [<ffffffd4ffb9ed64>] do_translation_fault[jt]+0x40 [<ffffffd4fec39c78>] do_mem_abort+0x58 [<ffffffd4ffb8c3e4>] el0_ia+0x70 [<ffffffd4ffb8c260>] el0t_64_sync_handler[jt]+0xb0 [<ffffffd4fec11588>] ret_to_user[jt]+0x0
To fix this, introducee cpu within vmap_block to record which this vb belongs to.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: fc1e0d980037 ("mm/vmalloc: prevent stale TLBs in fully utilized blocks") Signed-off-by: Zhaoyang Huang <[email protected]> Suggested-by: Hailong.Liu <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|