|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3 |
|
| #
e05741fb |
| 16-Apr-2025 |
Tianyang Zhang <[email protected]> |
mm/page_alloc.c: avoid infinite retries caused by cpuset race
__alloc_pages_slowpath has no change detection for ac->nodemask in the part of retry path, while cpuset can modify it in parallel. For
mm/page_alloc.c: avoid infinite retries caused by cpuset race
__alloc_pages_slowpath has no change detection for ac->nodemask in the part of retry path, while cpuset can modify it in parallel. For some processes that set mempolicy as MPOL_BIND, this results ac->nodemask changes, and then the should_reclaim_retry will judge based on the latest nodemask and jump to retry, while the get_page_from_freelist only traverses the zonelist from ac->preferred_zoneref, which selected by a expired nodemask and may cause infinite retries in some cases
cpu 64: __alloc_pages_slowpath { /* ..... */ retry: /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); /* cpu 1: cpuset_write_resmask update_nodemask update_nodemasks_hier update_tasks_nodemask mpol_rebind_task mpol_rebind_policy mpol_rebind_nodemask // mempolicy->nodes has been modified, // which ac->nodemask point to
*/ /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, did_some_progress > 0, &no_progress_loops)) goto retry; }
Simultaneously starting multiple cpuset01 from LTP can quickly reproduce this issue on a multi node server when the maximum memory pressure is reached and the swap is enabled
Link: https://lkml.kernel.org/r/[email protected] Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") Signed-off-by: Tianyang Zhang <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Brendan Jackman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Zi Yan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
fefc0751 |
| 06-May-2025 |
Kirill A. Shutemov <[email protected]> |
mm/page_alloc: fix race condition in unaccepted memory handling
The page allocator tracks the number of zones that have unaccepted memory using static_branch_enc/dec() and uses that static branch in
mm/page_alloc: fix race condition in unaccepted memory handling
The page allocator tracks the number of zones that have unaccepted memory using static_branch_enc/dec() and uses that static branch in hot paths to determine if it needs to deal with unaccepted memory.
Borislav and Thomas pointed out that the tracking is racy: operations on static_branch are not serialized against adding/removing unaccepted pages to/from the zone.
Sanity checks inside static_branch machinery detects it:
WARNING: CPU: 0 PID: 10 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x8e/0xa0
The comment around the WARN() explains the problem:
/* * Warn about the '-1' case though; since that means a * decrement is concurrent with a first (0->1) increment. IOW * people are trying to disable something that wasn't yet fully * enabled. This suggests an ordering problem on the user side. */
The effect of this static_branch optimization is only visible on microbenchmark.
Instead of adding more complexity around it, remove it altogether.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Link: https://lore.kernel.org/all/20250506092445.GBaBnVXXyvnazly6iF@fat_crate.local Reported-by: Borislav Petkov <[email protected]> Tested-by: Borislav Petkov (AMD) <[email protected]> Reported-by: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Brendan Jackman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: <[email protected]> [6.5+] Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
23fa022a |
| 06-May-2025 |
Kirill A. Shutemov <[email protected]> |
mm/page_alloc: ensure try_alloc_pages() plays well with unaccepted memory
try_alloc_pages() will not attempt to allocate memory if the system has *any* unaccepted memory. Memory is accepted as need
mm/page_alloc: ensure try_alloc_pages() plays well with unaccepted memory
try_alloc_pages() will not attempt to allocate memory if the system has *any* unaccepted memory. Memory is accepted as needed and can remain in the system indefinitely, causing the interface to always fail.
Rather than immediately giving up, attempt to use already accepted memory on free lists.
Pass 'alloc_flags' to cond_accept_memory() and do not accept new memory for ALLOC_TRYLOCK requests.
Found via code inspection - only BPF uses this at present and the runtime effects are unclear.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation") Cc: Alexei Starovoitov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Brendan Jackman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
0ae0227f |
| 05-May-2025 |
David Wang <[email protected]> |
mm/codetag: move tag retrieval back upfront in __free_pages()
Commit 51ff4d7486f0 ("mm: avoid extra mem_alloc_profiling_enabled() checks") introduces a possible use-after-free scenario, when page i
mm/codetag: move tag retrieval back upfront in __free_pages()
Commit 51ff4d7486f0 ("mm: avoid extra mem_alloc_profiling_enabled() checks") introduces a possible use-after-free scenario, when page is non-compound, page[0] could be released by other thread right after put_page_testzero failed in current thread, pgalloc_tag_sub_pages afterwards would manipulate an invalid page for accounting remaining pages:
[timeline] [thread1] [thread2] | alloc_page non-compound V | get_page, rf counter inc V | in ___free_pages | put_page_testzero fails V | put_page, page released V | in ___free_pages, | pgalloc_tag_sub_pages | manipulate an invalid page V
Restore __free_pages() to its state before, retrieve alloc tag beforehand.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 51ff4d7486f0 ("mm: avoid extra mem_alloc_profiling_enabled() checks") Signed-off-by: David Wang <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Brendan Jackman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
38448181 |
| 16-Apr-2025 |
Johannes Weiner <[email protected]> |
mm: vmscan: restore high-cpu watermark safety in kswapd
Vlastimil points out that commit a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks") switched kswapd from zone_watermark_
mm: vmscan: restore high-cpu watermark safety in kswapd
Vlastimil points out that commit a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks") switched kswapd from zone_watermark_ok_safe() to the standard, percpu-cached version of reading free pages, thus dropping the watermark safety precautions for systems with high CPU counts (e.g. >212 cpus on 64G). Restore them.
Since zone_watermark_ok_safe() is no longer the right interface, and this was the last caller of the function anyway, open-code the zone_page_state_snapshot() conditional and delete the function.
Link: https://lkml.kernel.org/r/[email protected] Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks") Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Vlastimil Babka <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Brendan Jackman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc2, v6.15-rc1 |
|
| #
4067196a |
| 29-Mar-2025 |
Kirill A. Shutemov <[email protected]> |
mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()
When the last page in the zone is accepted, __accept_page() calls static_branch_dec(). This function takes cpu_hotplug_lock, which
mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()
When the last page in the zone is accepted, __accept_page() calls static_branch_dec(). This function takes cpu_hotplug_lock, which can lead to a deadlock if the allocation occurs during CPU bringup path as _cpu_up() also takes the lock.
To prevent this deadlock, defer static_branch_dec() to a workqueue.
Call static_branch_dec() only when the workqueue is not yet initialized. Workqueues are initialized before CPU bring up, so this will not conflict with the first scenario.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 55ad43e8ba0f ("mm: add a helper to accept page") Signed-off-by: Kirill A. Shutemov <[email protected]> Reported-by: Srikanth Aithal <[email protected]> Tested-by: Srikanth Aithal <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ashish Kalra <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Mike Rapoport (IBM)" <[email protected]> Cc: Thomas Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
90abee6d |
| 07-Apr-2025 |
Johannes Weiner <[email protected]> |
mm: page_alloc: speed up fallbacks in rmqueue_bulk()
The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") as the root cause of a 56.4% regression in
mm: page_alloc: speed up fallbacks in rmqueue_bulk()
The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") as the root cause of a 56.4% regression in vm-scalability::lru-file-mmap-read.
Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion"), as the root cause for a regression in worst-case zone->lock+irqoff hold times.
Both of these patches modify the page allocator's fallback path to be less greedy in an effort to stave off fragmentation. The flip side of this is that fallbacks are also less productive each time around, which means the fallback search can run much more frequently.
Carlos' traces point to rmqueue_bulk() specifically, which tries to refill the percpu cache by allocating a large batch of pages in a loop. It highlights how once the native freelists are exhausted, the fallback code first scans orders top-down for whole blocks to claim, then falls back to a bottom-up search for the smallest buddy to steal. For the next batch page, it goes through the same thing again.
This can be made more efficient. Since rmqueue_bulk() holds the zone->lock over the entire batch, the freelists are not subject to outside changes; when the search for a block to claim has already failed, there is no point in trying again for the next page.
Modify __rmqueue() to remember the last successful fallback mode, and restart directly from there on the next rmqueue_bulk() iteration.
Oliver confirms that this improves beyond the regression that the test robot reported against c2f6ea38fc1b:
commit: f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap") c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy") acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") 2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch
f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5 ---------------- --------------------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev %change %stddev \ | \ | \ | \ 25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput
Carlos confirms that worst-case times are almost fully recovered compared to before the earlier culprit patch:
2dd482ba627d (before freelist hygiene): 1ms c0cd6f557b90 (after freelist hygiene): 90ms next-20250319 (steal smallest buddy): 280ms this patch : 8ms
[[email protected]: comment updates] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: reset rmqueue_mode in rmqueue_buddy() error loop, per Yunsheng Lin] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Brendan Jackman <[email protected]> Reported-by: kernel test robot <[email protected]> Reported-by: Carlos Song <[email protected]> Tested-by: Carlos Song <[email protected]> Tested-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Reviewed-by: Brendan Jackman <[email protected]> Tested-by: Shivank Garg <[email protected]> Acked-by: Zi Yan <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: <[email protected]> [6.10+] Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
c5bb27e2 |
| 31-Mar-2025 |
Alexei Starovoitov <[email protected]> |
mm/page_alloc: avoid second trylock of zone->lock
spin_trylock followed by spin_lock will cause extra write cache access. If the lock is contended it may cause unnecessary cache line bouncing and w
mm/page_alloc: avoid second trylock of zone->lock
spin_trylock followed by spin_lock will cause extra write cache access. If the lock is contended it may cause unnecessary cache line bouncing and will execute redundant irq restore/save pair. Therefore, check alloc/fpi_flags first and use spin_trylock or spin_lock.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation") Signed-off-by: Alexei Starovoitov <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Reviewed-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Harry Yoo <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Daniel Borkman <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
2985dae1 |
| 01-Apr-2025 |
Alexei Starovoitov <[email protected]> |
mm/page_alloc: Fix try_alloc_pages
Fix an obvious bug. try_alloc_pages() should set_page_refcounted.
[ Not so obvious: it was probably correct at the time it was written but was at some point the
mm/page_alloc: Fix try_alloc_pages
Fix an obvious bug. try_alloc_pages() should set_page_refcounted.
[ Not so obvious: it was probably correct at the time it was written but was at some point then rebased on top of v6.14-rc1.
And at that point there was a semantic conflict with commit efabfe1420f5 ("mm/page_alloc: move set_page_refcounted() to callers of get_page_from_freelist()") and became buggy. - Linus ]
Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation") Signed-off-by: Alexei Starovoitov <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil BAbka <[email protected]> Reviewed-by: Harry Yoo <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.14 |
|
| #
bd145bdd |
| 20-Mar-2025 |
Ye Liu <[email protected]> |
mm/page_alloc: replace flag check with PageHWPoison() in check_new_page_bad()
This patch replaces the direct check for the __PG_HWPOISON flag with the PageHWPoison() macro, improving code readabilit
mm/page_alloc: replace flag check with PageHWPoison() in check_new_page_bad()
This patch replaces the direct check for the __PG_HWPOISON flag with the PageHWPoison() macro, improving code readability and maintaining consistency with other parts of the memory management code.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ye Liu <[email protected]> Reviewed-by: Sidhartha Kumar <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
7a95a05f |
| 22-Mar-2025 |
Johannes Weiner <[email protected]> |
mm: page_alloc: fix defrag_mode's retry & OOM path
Brendan points out that defrag_mode doesn't properly clear ALLOC_NOFRAGMENT on its last-ditch attempt to allocate. But looking closer, the problem
mm: page_alloc: fix defrag_mode's retry & OOM path
Brendan points out that defrag_mode doesn't properly clear ALLOC_NOFRAGMENT on its last-ditch attempt to allocate. But looking closer, the problem is actually more severe: it doesn't actually *check* whether it's already retried, and keeps looping. This means the OOM path is never taken, and the thread can loop indefinitely.
This is verified with an intentional OOM test on defrag_mode=1, which results in the machine hanging. After this patch, it triggers the OOM kill reliably and recovers.
Clear ALLOC_NOFRAGMENT properly, and only retry once.
Link: https://lkml.kernel.org/r/[email protected] Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode") Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Brendan Jackman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
0a1e082b |
| 19-Mar-2025 |
Liu Ye <[email protected]> |
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
The `movable` variable is always used when `CONFIG_TRANSPARENT_HUGEPAGE` is enabled, so the `__maybe_unused` attribute is not ne
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
The `movable` variable is always used when `CONFIG_TRANSPARENT_HUGEPAGE` is enabled, so the `__maybe_unused` attribute is not necessary. This patch removes it and keeps the variable declaration within the `#ifdef` block for better clarity.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liu Ye<[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7 |
|
| #
1506c255 |
| 14-Mar-2025 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: simplify split_page_memcg()
The last argument to split_page_memcg() is now always 0, so remove it, effectively reverting commit b8791381d7ed.
Link: https://lkml.kernel.org/r/20250314133617.1380
mm: simplify split_page_memcg()
The last argument to split_page_memcg() is now always 0, so remove it, effectively reverting commit b8791381d7ed.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Zi Yan <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
a211c655 |
| 13-Mar-2025 |
Johannes Weiner <[email protected]> |
mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
The previous patch added pageblock_order reclaim to kswapd/kcompactd, which helps, but produces only one block at a time. Allocation stalls a
mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
The previous patch added pageblock_order reclaim to kswapd/kcompactd, which helps, but produces only one block at a time. Allocation stalls and THP failure rates are still higher than they could be.
To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change the watermarking for kswapd & kcompactd: instead of targeting the high watermark in order-0 pages and checking for one suitable block, simply require that the high watermark is entirely met in pageblocks.
To this end, track the number of free pages within contiguous pageblocks, then change pgdat_balanced() and compact_finished() to check watermarks against this new value.
This further reduces THP latencies and allocation stalls, and improves THP success rates against the previous patch:
DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.73%) Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.04%) Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.23%) Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.25%) Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.54%) THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.81%) THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.26%) Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.79%) Direct compact success 4.53 ( +0.00%) 2.80 ( -31.33%) Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.66%) Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.48%) Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.83%) Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.30%) Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.21%) Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.05%) Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.36%) Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.62%) Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.63%) Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.41%) Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.81%) Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.95%) Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.95%) Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.43%) Swap out 77238.20 ( +0.00%) 110084.33 ( +42.53%) Swap in 11712.80 ( +0.00%) 24457.00 ( +108.80%) File refaults 143438.80 ( +0.00%) 188226.93 ( +31.22%)
Also of note is that compaction work overall is reduced. The reason for this is that when free pageblocks are more readily available, allocations are also much more likely to get physically placed in LRU order, instead of being forced to scavenge free space here and there. This means that reclaim by itself has better chances of freeing up whole blocks, and the system relies less on compaction.
Comparing all changes to the vanilla kernel:
VANILLA DEFRAGMODE-ASYNC-WMARKS Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)
THP allocation latencies and %sys time are down dramatically.
THP allocation failures are down from nearly 50% to 8.5%. And to recall previous data points, the success rates are steady and reliable without the cumulative deterioration of fragmentation events.
Compaction work is down overall. Direct compaction work especially is drastically reduced. As an aside, its success rate of 4% indicates there is room for improvement. For now it's good to rely on it less.
Reclaim work is up overall, however direct reclaim work is down. Part of the increase can be attributed to a higher use of THPs, which due to internal fragmentation increase the memory footprint. This is not necessarily an unexpected side-effect for users of THP.
However, taken both points together, there may well be some opportunities for fine tuning in the reclaim/compaction coordination.
[[email protected]: fix squawks from rebasing] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
101f9d66 |
| 13-Mar-2025 |
Johannes Weiner <[email protected]> |
mm: page_alloc: defrag_mode kswapd/kcompactd assistance
When defrag_mode is enabled, allocation fallbacks strongly prefer whole block conversions instead of polluting or stealing partially used bloc
mm: page_alloc: defrag_mode kswapd/kcompactd assistance
When defrag_mode is enabled, allocation fallbacks strongly prefer whole block conversions instead of polluting or stealing partially used blocks. This means there is a demand for pageblocks even from sub-block requests. Let kswapd/kcompactd help produce them.
By the time kswapd gets woken up, normal rmqueue and block conversion fallbacks have been attempted and failed. So always wake kswapd with the block order; it will take care of producing a suitable compaction gap and then chain-wake kcompactd with the block order when its done.
VANILLA DEFRAGMODE-ASYNC Hugealloc Time mean 52739.45 ( +0.00%) 34300.36 ( -34.96%) Hugealloc Time stddev 56541.26 ( +0.00%) 36390.42 ( -35.64%) Kbuild Real time 197.47 ( +0.00%) 196.13 ( -0.67%) Kbuild User time 1240.49 ( +0.00%) 1234.74 ( -0.46%) Kbuild System time 70.08 ( +0.00%) 62.62 ( -10.50%) THP fault alloc 46727.07 ( +0.00%) 57054.53 ( +22.10%) THP fault fallback 21910.60 ( +0.00%) 11581.40 ( -47.14%) Direct compact fail 195.80 ( +0.00%) 107.80 ( -44.72%) Direct compact success 7.93 ( +0.00%) 4.53 ( -38.06%) Direct compact success rate % 3.51 ( +0.00%) 3.20 ( -6.89%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 5461033.93 ( +62.07%) Compact daemon scanned free 5075474.47 ( +0.00%) 5824897.93 ( +14.77%) Compact direct scanned migrate 161787.27 ( +0.00%) 58336.93 ( -63.94%) Compact direct scanned free 163467.53 ( +0.00%) 32791.87 ( -79.94%) Compact total migrate scanned 3531388.53 ( +0.00%) 5519370.87 ( +56.29%) Compact total free scanned 5238942.00 ( +0.00%) 5857689.80 ( +11.81%) Alloc stall 2371.07 ( +0.00%) 2424.60 ( +2.26%) Pages kswapd scanned 2160926.73 ( +0.00%) 2657018.33 ( +22.96%) Pages kswapd reclaimed 533191.07 ( +0.00%) 559583.07 ( +4.95%) Pages direct scanned 400450.33 ( +0.00%) 722094.07 ( +80.32%) Pages direct reclaimed 94441.73 ( +0.00%) 107257.80 ( +13.57%) Pages total scanned 2561377.07 ( +0.00%) 3379112.40 ( +31.93%) Pages total reclaimed 627632.80 ( +0.00%) 666840.87 ( +6.25%) Swap out 47959.53 ( +0.00%) 77238.20 ( +61.05%) Swap in 7276.00 ( +0.00%) 11712.80 ( +60.97%) File refaults 138043.00 ( +0.00%) 143438.80 ( +3.91%)
With this patch, defrag_mode=1 beats the vanilla kernel in THP success rates and allocation latencies. The trend holds over time:
thp_fault_alloc
VANILLA DEFRAGMODE-ASYNC 61988 52066 56474 58844 57258 58233 50187 58476 52388 54516 55409 59938 52925 57204 47648 60238 43669 55733 40621 56211 36077 59861 41721 57771 36685 58579 34641 51868 33215 56280
DEFRAGMODE-ASYNC also wins on %sys as ~3/4 of the direct compaction work is shifted to kcompactd.
Reclaim activity is higher. Part of that is simply due to the increased memory footprint from higher THP use. The other aspect is that *direct* reclaim/compaction are still going for requested orders rather than targeting the page blocks required for fallbacks, which is less efficient than it could be. However, this is already a useful tradeoff to make, as in many environments peak periods are short and retaining the ability to produce THP through them is more important.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
e3aa7df3 |
| 13-Mar-2025 |
Johannes Weiner <[email protected]> |
mm: page_alloc: defrag_mode
The page allocator groups requests by migratetype to stave off fragmentation. However, in practice this is routinely defeated by the fact that it gives up *before* invok
mm: page_alloc: defrag_mode
The page allocator groups requests by migratetype to stave off fragmentation. However, in practice this is routinely defeated by the fact that it gives up *before* invoking reclaim and compaction - which may well produce suitable pages. As a result, fragmentation of physical memory is a common ongoing process in many load scenarios.
Fragmentation deteriorates compaction's ability to produce huge pages. Depending on the lifetime of the fragmenting allocations, those effects can be long-lasting or even permanent, requiring drastic measures like forcible idle states or even reboots as the only reliable ways to recover the address space for THP production.
In a kernel build test with supplemental THP pressure, the THP allocation rate steadily declines over 15 runs:
thp_fault_alloc 61988 56474 57258 50187 52388 55409 52925 47648 43669 40621 36077 41721 36685 34641 33215
This is a hurdle in adopting THP in any environment where hosts are shared between multiple overlapping workloads (cloud environments), and rarely experience true idle periods. To make THP a reliable and predictable optimization, there needs to be a stronger guarantee to avoid such fragmentation.
Introduce defrag_mode. When enabled, reclaim/compaction is invoked to its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT is enforced on the allocator fastpath and the reclaiming slowpath.
For now, fallbacks are permitted to avert OOMs. There is a plan to add defrag_mode=2 to prefer OOMs over fragmentation, but this requires additional prep work in compaction and the reserve management to make it ready for all possible allocation contexts.
The following test results are from a kernel build with periodic bursts of THP allocations, over 15 runs:
vanilla defrag_mode=1 @claimer[unmovable]: 189 103 @claimer[movable]: 92 103 @claimer[reclaimable]: 207 61 @pollute[unmovable from movable]: 25 0 @pollute[unmovable from reclaimable]: 28 0 @pollute[movable from unmovable]: 38835 0 @pollute[movable from reclaimable]: 147136 0 @pollute[reclaimable from unmovable]: 178 0 @pollute[reclaimable from movable]: 33 0 @steal[unmovable from movable]: 11 0 @steal[unmovable from reclaimable]: 5 0 @steal[reclaimable from unmovable]: 107 0 @steal[reclaimable from movable]: 90 0 @steal[movable from reclaimable]: 354 0 @steal[movable from unmovable]: 130 0
Both types of polluting fallbacks are eliminated in this workload.
Interestingly, whole block conversions are reduced as well. This is because once a block is claimed for a type, its empty space remains available for future allocations, instead of being padded with fallbacks; this allows the native type to group up instead of spreading out to new blocks. The assumption in the allocator has been that pollution from movable allocations is less harmful than from other types, since they can be reclaimed or migrated out should the space be needed. However, since fallbacks occur *before* reclaim/compaction is invoked, movable pollution will still cause non-movable allocations to spread out and claim more blocks.
Without fragmentation, THP rates hold steady with defrag_mode=1:
thp_fault_alloc 32478 20725 45045 32130 14018 21711 40791 29134 34458 45381 28305 17265 22584 28454 30850
While the downward trend is eliminated, the keen reader will of course notice that the baseline rate is much smaller than the vanilla kernel's to begin with. This is due to deficiencies in how reclaim and compaction are currently driven: ALLOC_NOFRAGMENT increases the extent to which smaller allocations are competing with THPs for pageblocks, while making no effort themselves to reclaim or compact beyond their own request size. This effect already exists with the current usage of ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole block stealing much more strongly.
Subsequent patches will address defrag_mode reclaim strategy to raise the THP success baseline above the vanilla kernel.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
f46012c0 |
| 13-Mar-2025 |
Johannes Weiner <[email protected]> |
mm: page_alloc: trace type pollution from compaction capturing
When the page allocator places pages of a certain migratetype into blocks of another type, it has lasting effects on the ability to com
mm: page_alloc: trace type pollution from compaction capturing
When the page allocator places pages of a certain migratetype into blocks of another type, it has lasting effects on the ability to compact and defragment down the line. For improving placement and compaction, visibility into such events is crucial.
The most common case, allocator fallbacks, is already annotated, but compaction capturing is also allowed to grab pages of a different type. Extend the tracepoint to cover this case.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Zi Yan <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
15766485 |
| 08-Mar-2025 |
Martin Liu <[email protected]> |
mm/page_alloc: add trace event for totalreserve_pages calculation
This commit introduces a new trace event, `mm_calculate_totalreserve_pages`, which reports the new reserve value at the exact time w
mm/page_alloc: add trace event for totalreserve_pages calculation
This commit introduces a new trace event, `mm_calculate_totalreserve_pages`, which reports the new reserve value at the exact time when it takes effect.
The `totalreserve_pages` value represents the total amount of memory reserved across all zones and nodes in the system. This reserved memory is crucial for ensuring that critical kernel operations have access to sufficient memory, even under memory pressure.
By tracing the `totalreserve_pages` value, developers can gain insights that how the total reserved memory changes over time.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Martin Liu <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
a293aba4 |
| 08-Mar-2025 |
Martin Liu <[email protected]> |
mm/page_alloc: add trace event for per-zone lowmem reserve setup
This commit introduces the `mm_setup_per_zone_lowmem_reserve` trace event,which provides detailed insights into the kernel's per-zone
mm/page_alloc: add trace event for per-zone lowmem reserve setup
This commit introduces the `mm_setup_per_zone_lowmem_reserve` trace event,which provides detailed insights into the kernel's per-zone lowmem reserve configuration.
The trace event provides precise timestamps, allowing developers to
1. Correlate lowmem reserve changes with specific kernel events and able to diagnose unexpected kswapd or direct reclaim behavior triggered by dynamic changes in lowmem reserve.
2. Know memory allocation failures that occur due to insufficient lowmem reserve, by precisely correlating allocation attempts with reserve adjustments.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Martin Liu <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8c02048d |
| 08-Mar-2025 |
Martin Liu <[email protected]> |
mm/page_alloc: add trace event for per-zone watermark setup
Patch series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages", v2.
This patchset introduces tracepoints to track
mm/page_alloc: add trace event for per-zone watermark setup
Patch series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages", v2.
This patchset introduces tracepoints to track changes in the lowmem reserves, watermarks and totalreserve_pages. This helps to track the exact timing of such changes and understand their relation to reclaim activities.
The tracepoints added are:
mm_setup_per_zone_lowmem_reserve mm_setup_per_zone_wmarks mm_calculate_totalreserve_pagesi
This patch (of 3):
This commit introduces the `mm_setup_per_zone_wmarks` trace event, which provides detailed insights into the kernel's per-zone watermark configuration, offering precise timing and the ability to correlate watermark changes with specific kernel events.
While `/proc/zoneinfo` provides some information about zone watermarks, this trace event offers:
1. The ability to link watermark changes to specific kernel events and logic.
2. The ability to capture rapid or short-lived changes in watermarks that may be missed by user-space polling
3. Diagnosing unexpected kswapd activity or excessive direct reclaim triggered by rapidly changing watermarks.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Martin Liu <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Martin Liu <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
74949222 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
Everything is in place to stop using the per-page mapcounts in large folios: the mapcount of tail pages will alwa
mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
Everything is in place to stop using the per-page mapcounts in large folios: the mapcount of tail pages will always be logically 0 (-1 value), just like it currently is for hugetlb folios already, and the page mapcount of the head page is either 0 (-1 value) or contains a page type (e.g., hugetlb).
Maintaining _nr_pages_mapped without per-page mapcounts is impossible, so that one also has to go with CONFIG_NO_PAGE_MAPCOUNT.
There are two remaining implications:
(1) Per-node, per-cgroup and per-lruvec stats of "NR_ANON_MAPPED" ("mapped anonymous memory") and "NR_FILE_MAPPED" ("mapped file memory"):
As soon as any page of the folio is mapped -- folio_mapped() -- we now account the complete folio as mapped. Once the last page is unmapped -- !folio_mapped() -- we account the complete folio as unmapped.
This implies that ...
* "AnonPages" and "Mapped" in /proc/meminfo and /sys/devices/system/node/*/meminfo * cgroup v2: "anon" and "file_mapped" in "memory.stat" and "memory.numa_stat" * cgroup v1: "rss" and "mapped_file" in "memory.stat" and "memory.numa_stat
... can now appear higher than before. But note that these folios do consume that memory, simply not all pages are actually currently mapped.
It's worth nothing that other accounting in the kernel (esp. cgroup charging on allocation) is not affected by this change.
[why oh why is "anon" called "rss" in cgroup v1]
(2) Detecting partial mappings
Detecting whether anon THPs are partially mapped gets a bit more unreliable. As long as a single MM maps such a large folio ("exclusively mapped"), we can reliably detect it. Especially before fork() / after a short-lived child process quit, we will detect partial mappings reliably, which is the common case.
In essence, if the average per-page mapcount in an anon THP is < 1, we know for sure that we have a partial mapping.
However, as soon as multiple MMs are involved, we might miss detecting partial mappings: this might be relevant with long-lived child processes. If we have a fully-mapped anon folio before fork(), once our child processes and our parent all unmap (zap/COW) the same pages (but not the complete folio), we might not detect the partial mapping. However, once the child processes quit we would detect the partial mapping.
How relevant this case is in practice remains to be seen. Swapout/migration will likely mitigate this.
In the future, RMAP walkers could check for that for that case (e.g., when collecting access bits during reclaim) and simply flag them for deferred-splitting.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
6af8cb80 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: basic MM owner tracking for large folios (!hugetlb)
For small folios, we traditionally use the mapcount to decide whether it was "certainly mapped exclusively" by a single MM (mapcount == 1
mm/rmap: basic MM owner tracking for large folios (!hugetlb)
For small folios, we traditionally use the mapcount to decide whether it was "certainly mapped exclusively" by a single MM (mapcount == 1) or whether it "maybe mapped shared" by multiple MMs (mapcount > 1). For PMD-sized folios that were PMD-mapped, we were able to use a similar mechanism (single PMD mapping), but for PTE-mapped folios and in the future folios that span multiple PMDs, this does not work.
So we need a different mechanism to handle large folios. Let's add a new mechanism to detect whether a large folio is "certainly mapped exclusively", or whether it is "maybe mapped shared".
We'll use this information next to optimize CoW reuse for PTE-mapped anonymous THP, and to convert folio_likely_mapped_shared() to folio_maybe_mapped_shared(), independent of per-page mapcounts.
For each large folio, we'll have two slots, whereby a slot stores: (1) an MM id: unique id assigned to each MM (2) a per-MM mapcount
If a slot is unoccupied, it can be taken by the next MM that maps folio page.
In addition, we'll remember the current state -- "mapped exclusively" vs. "maybe mapped shared" -- and use a bit spinlock to sync on updates and to reduce the total number of atomic accesses on updates. In the future, it might be possible to squeeze a proper spinlock into "struct folio". For now, keep it simple, as we require the whole thing with THP only, that is incompatible with RT.
As we have to squeeze this information into the "struct folio" of even folios of order-1 (2 pages), and we generally want to reduce the required metadata, we'll assign each MM a unique ID that can fit into an int. In total, we can squeeze everything into 4x int (2x long) on 64bit.
32bit support is a bit challenging, because we only have 2x long == 2x int in order-1 folios. But we can make it work for now, because we neither expect many MMs nor very large folios on 32bit.
We will reliably detect folios as "mapped exclusively" vs. "mapped shared" as long as only two MMs map pages of a folio at one point in time -- for example with fork() and short-lived child processes, or with apps that hand over state from one instance to another.
As soon as three MMs are involved at the same time, we might detect "maybe mapped shared" although the folio is "mapped exclusively".
Example 1:
(1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (4) App1 unmaps all folio pages
-> We will detect "mapped exclusively".
Example 2:
(1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (3) App3 faults in a folio page -> No slot available, tracked as "unknown" (4) App1 and App2 unmap all folio pages
-> We will detect "maybe mapped shared".
Make use of __always_inline to keep possible performance degradation when (un)mapping large folios to a minimum.
Note: by squeezing the two flags into the "unsigned long" that stores the MM ids, we can use non-atomic __bit_spin_unlock() and non-atomic setting/clearing of the "maybe mapped shared" bit, effectively not adding any new atomics on the hot path when updating the large mapcount + new metadata, which further helps reduce the runtime overhead in micro-benchmarks.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
845d2be6 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm: move _entire_mapcount in folio to page[2] on 32bit
Let's free up some space on 32bit in page[1] by moving the _pincount to page[2].
Ordinary folios only use the entire mapcount with PMD mapping
mm: move _entire_mapcount in folio to page[2] on 32bit
Let's free up some space on 32bit in page[1] by moving the _pincount to page[2].
Ordinary folios only use the entire mapcount with PMD mappings, so order-1 folios don't apply. Similarly, hugetlb folios are always larger than order-1, turning the entire mapcount essentially unused for all order-1 folios. Moving it to order-1 folios will not change anything.
On 32bit, simply check in folio_entire_mapcount() whether we have an order-1 folio, and return 0 in that case.
Note that THPs on 32bit are not particularly common (and we don't care too much about performance), but we want to keep it working reliably, because likely we want to use large folios there as well in the future, independent of PMD leaf support.
Once we dynamically allocate "struct folio", the 32bit specifics will go away again; even small folios could then have a pincount.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
31a31da8 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm: move _pincount in folio to page[2] on 32bit
Let's free up some space on 32bit in page[1] by moving the _pincount to page[2].
For order-1 folios (never anon folios!) on 32bit, we will now also u
mm: move _pincount in folio to page[2] on 32bit
Let's free up some space on 32bit in page[1] by moving the _pincount to page[2].
For order-1 folios (never anon folios!) on 32bit, we will now also use the GUP_PIN_COUNTING_BIAS approach. A fully-mapped order-1 folio requires 2 references. With GUP_PIN_COUNTING_BIAS being 1024, we'd detect such folios as "maybe pinned" with 512 full mappings, instead of 1024 for order-0. As anon folios are out of the picture (which are the most relevant users of checking for pinnings on *mapped* pages) and we are talking about 32bit, this is not expected to cause any trouble.
In __dump_page(), copy one additional folio page if we detect a folio with an order > 1, so we can dump the pincount on order > 1 folios reliably.
Note that THPs on 32bit are not particularly common (and we don't care too much about performance), but we want to keep it working reliably, because likely we want to use large folios there as well in the future, independent of PMD leaf support.
Once we dynamically allocate "struct folio", fortunately the 32bit specifics will likely go away again; even small folios could then have a pincount and folio_has_pincount() would essentially always return "true".
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
4eeec8c8 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm: move hugetlb specific things in folio to page[3]
Let's just move the hugetlb specific stuff to a separate page, and stop letting it overlay other fields for now.
This frees up some space in pag
mm: move hugetlb specific things in folio to page[3]
Let's just move the hugetlb specific stuff to a separate page, and stop letting it overlay other fields for now.
This frees up some space in page[2], which we will use on 32bit to free up some space in page[1]. While we could move these things to page[3] instead, it's cleaner to just move the hugetlb specific things out of the way and pack the core-folio stuff as tight as possible. ... and we can minimize the work required in dump_folio.
We can now avoid re-initializing &folio->_deferred_list in hugetlb code.
Hopefully dynamically allocating "strut folio" in the future will further clean this up.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|