|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
a84edd52 |
| 01-Apr-2025 |
Vishal Moola (Oracle) <[email protected]> |
mm/compaction: fix bug in hugetlb handling pathway
The compaction code doesn't take references on pages until we're certain we should attempt to handle it.
In the hugetlb case, isolate_or_dissolve_
mm/compaction: fix bug in hugetlb handling pathway
The compaction code doesn't take references on pages until we're certain we should attempt to handle it.
In the hugetlb case, isolate_or_dissolve_huge_page() may return -EBUSY without taking a reference to the folio associated with our pfn. If our folio's refcount drops to 0, compound_nr() becomes unpredictable, making low_pfn and nr_scanned unreliable. The user-visible effect is minimal - this should rarely happen (if ever).
Fix this by storing the folio statistics earlier on the stack (just like the THP and Buddy cases).
Also revert commit 66fe1cf7f581 ("mm: compaction: use helper compound_nr in isolate_migratepages_block") to make backporting easier.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 369fa227c219 ("mm: make alloc_contig_range handle free hugetlb pages") Signed-off-by: Vishal Moola (Oracle) <[email protected]> Acked-by: Oscar Salvador <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7 |
|
| #
a211c655 |
| 13-Mar-2025 |
Johannes Weiner <[email protected]> |
mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
The previous patch added pageblock_order reclaim to kswapd/kcompactd, which helps, but produces only one block at a time. Allocation stalls a
mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
The previous patch added pageblock_order reclaim to kswapd/kcompactd, which helps, but produces only one block at a time. Allocation stalls and THP failure rates are still higher than they could be.
To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change the watermarking for kswapd & kcompactd: instead of targeting the high watermark in order-0 pages and checking for one suitable block, simply require that the high watermark is entirely met in pageblocks.
To this end, track the number of free pages within contiguous pageblocks, then change pgdat_balanced() and compact_finished() to check watermarks against this new value.
This further reduces THP latencies and allocation stalls, and improves THP success rates against the previous patch:
DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.73%) Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.04%) Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.23%) Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.25%) Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.54%) THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.81%) THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.26%) Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.79%) Direct compact success 4.53 ( +0.00%) 2.80 ( -31.33%) Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.66%) Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.48%) Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.83%) Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.30%) Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.21%) Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.05%) Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.36%) Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.62%) Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.63%) Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.41%) Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.81%) Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.95%) Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.95%) Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.43%) Swap out 77238.20 ( +0.00%) 110084.33 ( +42.53%) Swap in 11712.80 ( +0.00%) 24457.00 ( +108.80%) File refaults 143438.80 ( +0.00%) 188226.93 ( +31.22%)
Also of note is that compaction work overall is reduced. The reason for this is that when free pageblocks are more readily available, allocations are also much more likely to get physically placed in LRU order, instead of being forced to scavenge free space here and there. This means that reclaim by itself has better chances of freeing up whole blocks, and the system relies less on compaction.
Comparing all changes to the vanilla kernel:
VANILLA DEFRAGMODE-ASYNC-WMARKS Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)
THP allocation latencies and %sys time are down dramatically.
THP allocation failures are down from nearly 50% to 8.5%. And to recall previous data points, the success rates are steady and reliable without the cumulative deterioration of fragmentation events.
Compaction work is down overall. Direct compaction work especially is drastically reduced. As an aside, its success rate of 4% indicates there is room for improvement. For now it's good to rely on it less.
Reclaim work is up overall, however direct reclaim work is down. Part of the increase can be attributed to a higher use of THPs, which due to internal fragmentation increase the memory footprint. This is not necessarily an unexpected side-effect for users of THP.
However, taken both points together, there may well be some opportunities for fine tuning in the reclaim/compaction coordination.
[[email protected]: fix squawks from rebasing] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
67914ac0 |
| 13-Mar-2025 |
Johannes Weiner <[email protected]> |
mm: compaction: push watermark into compaction_suitable() callers
Patch series "mm: reliable huge page allocator".
This series makes changes to the allocator and reclaim/compaction code to try hard
mm: compaction: push watermark into compaction_suitable() callers
Patch series "mm: reliable huge page allocator".
This series makes changes to the allocator and reclaim/compaction code to try harder to avoid fragmentation. As a result, this makes huge page allocations cheaper, more reliable and more sustainable.
It's a subset of the huge page allocator RFC initially proposed here:
https://lore.kernel.org/lkml/[email protected]/
The following results are from a kernel build test, with additional concurrent bursts of THP allocations on a memory-constrained system. Comparing before and after the changes over 15 runs:
before after Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)
THP latencies are cut in half, and failure rates are cut by 75%. These metrics also hold up over time, while the vanilla kernel sees a steady downward trend in success rates with each subsequent run, owed to the cumulative effects of fragmentation.
A more detailed discussion of results is in the patch changelogs.
The patches first introduce a vm.defrag_mode sysctl, which enforces the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and compaction have run. They then change kswapd and kcompactd to target pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths.
Patches #1 and #2 are somewhat unrelated cleanups, but touch the same code and so are included here to avoid conflicts from re-ordering.
This patch (of 5):
compaction_suitable() hardcodes the min watermark, with a boost to the low watermark for costly orders. However, compaction_ready() requires order-0 at the high watermark. It currently checks the marks twice.
Make the watermark a parameter to compaction_suitable() and have the callers pass in what they require:
- compaction_zonelist_suitable() is used by the direct reclaim path, so use the min watermark.
- compact_suit_allocation_order() has a watermark in context derived from cc->alloc_flags.
The only quirk is that kcompactd doesn't initialize cc->alloc_flags explicitly. There is a direct check in kcompactd_do_work() that passes ALLOC_WMARK_MIN, but there is another check downstack in compact_zone() that ends up passing the unset alloc_flags. Since they default to 0, and that coincides with ALLOC_WMARK_MIN, it is correct. But it's subtle. Set cc->alloc_flags explicitly.
- should_continue_reclaim() is direct reclaim, use the min watermark.
- Finally, consolidate the two checks in compaction_ready() to a single compaction_suitable() call passing the high watermark.
There is a tiny change in behavior: before, compaction_suitable() would check order-0 against min or low, depending on costly order. Then there'd be another high watermark check.
Now, the high watermark is passed to compaction_suitable(), and the costly order-boost (low - min) is added on top. This means compaction_ready() sets a marginally higher target for free pages.
In a kernelbuild + THP pressure test, though, this didn't show any measurable negative effects on memory pressure or reclaim rates. As the comment above the check says, reclaim is usually stopped short on should_continue_reclaim(), and this just defines the worst-case reclaim cutoff in case compaction is not making any headway.
[[email protected]: stop oops on out-of-range highest_zoneidx] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Zi Yan <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5 |
|
| #
e47f1f56 |
| 28-Feb-2025 |
Brendan Jackman <[email protected]> |
mm/page_alloc: clarify terminology in migratetype fallback code
Patch series "mm/page_alloc: Some clarifications for migratetype fallback", v4.
A couple of patches to try and make the code easier t
mm/page_alloc: clarify terminology in migratetype fallback code
Patch series "mm/page_alloc: Some clarifications for migratetype fallback", v4.
A couple of patches to try and make the code easier to follow.
This patch (of 2):
This code is rather confusing because:
1. "Steal" is sometimes used to refer to the general concept of allocating from a from a block of a fallback migratetype (steal_suitable_fallback()) but sometimes it refers specifically to converting a whole block's migratetype (can_steal_fallback()).
2. can_steal_fallback() sounds as though it's answering the question "am I functionally permitted to allocate from that other type" but in fact it is encoding a heuristic preference.
3. The same piece of data has different names in different places: can_steal vs whole_block. This reinforces point 2 because it looks like the different names reflect a shift in intent from "am I allowed to steal" to "do I want to steal", but no such shift exists.
Fix 1. by avoiding the term "steal" in ambiguous contexts. Start using the term "claim" to refer to the special case of stealing the entire block.
Fix 2. by using "should" instead of "can", and also rename its parameters and add some commentary to make it more explicit what they mean.
Fix 3. by adopting the new "claim" terminology universally for this set of variables.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Brendan Jackman <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
ce6d9c1c |
| 25-Feb-2025 |
Mike Snitzer <[email protected]> |
NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback
Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for it so nfs_release_folio() can skip calling nfs_wb_folio() fr
NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback
Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for it so nfs_release_folio() can skip calling nfs_wb_folio() from kcompactd.
Otherwise NFS can deadlock waiting for kcompactd enduced writeback which recurses back to NFS (which triggers writeback to NFSD via NFS loopback mount on the same host, NFSD blocks waiting for XFS's call to __filemap_get_folio):
6070.550357] INFO: task kcompactd0:58 blocked for more than 4435 seconds.
{--- [58] "kcompactd0" [<0>] folio_wait_bit+0xe8/0x200 [<0>] folio_wait_writeback+0x2b/0x80 [<0>] nfs_wb_folio+0x80/0x1b0 [nfs] [<0>] nfs_release_folio+0x68/0x130 [nfs] [<0>] split_huge_page_to_list_to_order+0x362/0x840 [<0>] migrate_pages_batch+0x43d/0xb90 [<0>] migrate_pages_sync+0x9a/0x240 [<0>] migrate_pages+0x93c/0x9f0 [<0>] compact_zone+0x8e2/0x1030 [<0>] compact_node+0xdb/0x120 [<0>] kcompactd+0x121/0x2e0 [<0>] kthread+0xcf/0x100 [<0>] ret_from_fork+0x31/0x40 [<0>] ret_from_fork_asm+0x1a/0x30 ---}
[[email protected]: fix build] Link: https://lkml.kernel.org/r/[email protected] Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page") Signed-off-by: Mike Snitzer <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
6268f0a1 |
| 25-Jan-2025 |
yangge <[email protected]> |
mm: compaction: use the proper flag to determine watermarks
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of memory. I have configured 16GB of CMA memory on each NUMA node, and
mm: compaction: use the proper flag to determine watermarks
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of memory. I have configured 16GB of CMA memory on each NUMA node, and starting a 32GB virtual machine with device passthrough is extremely slow, taking almost an hour.
Long term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of no-CMA memory on a NUMA node can be used as virtual machine memory. There is 16GB of free CMA memory on a NUMA node, which is sufficient to pass the order-0 watermark check, causing the __compaction_suitable() function to consistently return true.
For costly allocations, if the __compaction_suitable() function always returns true, it causes the __alloc_pages_slowpath() function to fail to exit at the appropriate point. This prevents timely fallback to allocating memory on other nodes, ultimately resulting in excessively long virtual machine startup times.
Call trace: __alloc_pages_slowpath if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; // should exit __alloc_pages_slowpath() from here
We could use the real unmovable allocation context to have __zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass the order-0 check anymore once the non-CMA part is exhausted. There is some risk that in some different scenario the compaction could in fact migrate pages from the exhausted non-CMA part of the zone to the CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY allocations should be affected in the immediate "goto nopage" when compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't fail without trying to compact-migrate the non-CMA pageblocks into CMA pageblocks first, so it should be fine.
After this fix, it only takes a few tens of seconds to start a 32GB virtual machine with device passthrough functionality.
Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: yangge <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
1751f872 |
| 28-Jan-2025 |
Joel Granados <[email protected]> |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysc
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers.
Created this by running an spatch followed by a sed command: Spatch: virtual patch
@ depends on !(file in "net") disable optional_qualifier @
identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@
+ const struct ctl_table table_name [] = { ... };
sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c
Reviewed-by: Song Liu <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> # for kernel/trace/ Reviewed-by: Martin K. Petersen <[email protected]> # SCSI Reviewed-by: Darrick J. Wong <[email protected]> # xfs Acked-by: Jani Nikula <[email protected]> Acked-by: Corey Minyard <[email protected]> Acked-by: Wei Liu <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Reviewed-by: Bill O'Donnell <[email protected]> Acked-by: Baoquan He <[email protected]> Acked-by: Ashutosh Dixit <[email protected]> Acked-by: Anna Schumaker <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
| #
d1366e74 |
| 23-Jan-2025 |
Liu Shixin <[email protected]> |
mm/compaction: fix UBSAN shift-out-of-bounds warning
syzkaller reported a UBSAN shift-out-of-bounds warning of (1UL << order) in isolate_freepages_block(). The bogus compound_order can be any value
mm/compaction: fix UBSAN shift-out-of-bounds warning
syzkaller reported a UBSAN shift-out-of-bounds warning of (1UL << order) in isolate_freepages_block(). The bogus compound_order can be any value because it is union with flags. Add back the MAX_PAGE_ORDER check to fix the warning.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 3da0272a4c7d ("mm/compaction: correctly return failure with bogus compound_order in strict mode") Signed-off-by: Liu Shixin <[email protected]> Reviewed-by: Kemeng Shi <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Baolin Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
8fd10a89 |
| 25-Nov-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
mm/page_alloc: move set_page_refcounted() to callers of post_alloc_hook()
In preparation for allocating frozen pages, stop initialising the page refcount in post_alloc_hook().
Link: https://lkml.ke
mm/page_alloc: move set_page_refcounted() to callers of post_alloc_hook()
In preparation for allocating frozen pages, stop initialising the page refcount in post_alloc_hook().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Reviewed-by: Zi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Muchun Song <[email protected]> Cc: William Kucharski <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1 |
|
| #
54880b5a |
| 26-Sep-2024 |
Frederic Weisbecker <[email protected]> |
mm: Create/affine kcompactd to its preferred node
Kcompactd is dedicated to a specific node. As such it wants to be preferrably affine to it, memory and CPUs-wise.
Use the proper kthread API to ach
mm: Create/affine kcompactd to its preferred node
Kcompactd is dedicated to a specific node. As such it wants to be preferrably affine to it, memory and CPUs-wise.
Use the proper kthread API to achieve that. As a bonus it takes care of CPU-hotplug events and CPU-isolation on its behalf.
Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5 |
|
| #
435b3894 |
| 22-Aug-2024 |
Zhongkun He <[email protected]> |
mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath()
should_reclaim_retry() is not ALLOC_CPUSET aware and that means that it considers reclaimability of NUMA nodes which are outside
mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath()
should_reclaim_retry() is not ALLOC_CPUSET aware and that means that it considers reclaimability of NUMA nodes which are outside of the cpuset. If other nodes have a lot of reclaimable memory then should_reclaim_retry would instruct page allocator to retry even though there is no memory reclaimable on the cpuset nodemask. This is not really a huge problem because the number of retries without any reclaim progress is bound but it could be certainly improved. This is a cold path so this shouldn't really have a measurable impact on performance on most workloads.
1.Test step and the machines. ------------ root@vm:/sys/fs/cgroup/test# numactl -H | grep size node 0 size: 9477 MB node 1 size: 10079 MB node 2 size: 10079 MB node 3 size: 10078 MB
root@vm:/sys/fs/cgroup/test# cat cpuset.mems 2
root@vm:/sys/fs/cgroup/test# stress --vm 1 --vm-bytes 12g --vm-keep stress: info: [33430] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd stress: FAIL: [33430] (425) <-- worker 33431 got signal 9 stress: WARN: [33430] (427) now reaping child worker processes stress: FAIL: [33430] (461) failed run completed in 2s
2. reclaim_retry_zone info:
We can only alloc pages from node=2, but the reclaim_retry_zone is node=0 and return true.
root@vm:/sys/kernel/debug/tracing# cat trace stress-33431 [001] ..... 13223.617311: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=1 wmark_check=1 stress-33431 [001] ..... 13223.617682: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=2 wmark_check=1 stress-33431 [001] ..... 13223.618103: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=3 wmark_check=1 stress-33431 [001] ..... 13223.618454: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=4 wmark_check=1 stress-33431 [001] ..... 13223.618770: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=5 wmark_check=1 stress-33431 [001] ..... 13223.619150: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=6 wmark_check=1 stress-33431 [001] ..... 13223.619510: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=7 wmark_check=1 stress-33431 [001] ..... 13223.619850: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=8 wmark_check=1 stress-33431 [001] ..... 13223.620171: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=9 wmark_check=1 stress-33431 [001] ..... 13223.620533: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=10 wmark_check=1 stress-33431 [001] ..... 13223.620894: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=11 wmark_check=1 stress-33431 [001] ..... 13223.621224: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=12 wmark_check=1 stress-33431 [001] ..... 13223.621551: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=13 wmark_check=1 stress-33431 [001] ..... 13223.621847: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=14 wmark_check=1 stress-33431 [001] ..... 13223.622200: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=15 wmark_check=1 stress-33431 [001] ..... 13223.622580: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=16 wmark_check=1
With this patch, we can check the right node and get less retry in __alloc_pages_slowpath() because there is nothing to do.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhongkun He <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4 |
|
| #
e98337d1 |
| 14-Aug-2024 |
Yu Zhao <[email protected]> |
mm/contig_alloc: support __GFP_COMP
Patch series "mm/hugetlb: alloc/free gigantic folios", v2.
Use __GFP_COMP for gigantic folios can greatly reduce not only the amount of code but also the allocat
mm/contig_alloc: support __GFP_COMP
Patch series "mm/hugetlb: alloc/free gigantic folios", v2.
Use __GFP_COMP for gigantic folios can greatly reduce not only the amount of code but also the allocation and free time.
Approximate LOC to mm/hugetlb.c: +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by: time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After Alloc ~13s ~10s Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU models.
Perf profile before: Alloc - 99.99% alloc_pool_huge_folio - __alloc_fresh_hugetlb_folio - 83.23% alloc_contig_pages_noprof - 47.46% alloc_contig_range_noprof - 20.96% isolate_freepages_range 16.10% split_page - 14.10% start_isolate_page_range - 12.02% undo_isolate_page_range
Free - update_and_free_pages_bulk - 87.71% free_contig_range - 76.02% free_unref_page - 41.30% free_unref_page_commit - 32.58% free_pcppages_bulk - 24.75% __free_one_page 13.96% _raw_spin_trylock 12.27% __update_and_free_hugetlb_folio
Perf profile after: Alloc - 99.99% alloc_pool_huge_folio alloc_gigantic_folio - alloc_contig_pages_noprof - 59.15% alloc_contig_range_noprof - 20.72% start_isolate_page_range 20.64% prep_new_page - 17.13% undo_isolate_page_range
Free - update_and_free_pages_bulk - __folio_put - __free_pages_ok 7.46% free_tail_page_prepare - 1.97% free_one_page 1.86% __free_one_page
This patch (of 3):
Support __GFP_COMP in alloc_contig_range(). When the flag is set, upon success the function returns a large folio prepared by prep_new_page(), rather than a range of order-0 pages prepared by split_free_pages() (which is renamed from split_map_pages()).
alloc_contig_range() can be used to allocate folios larger than MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios. So on the free path, free_one_page() needs to handle that by split_large_buddy().
[[email protected]: fix folio_alloc_gigantic_noprof() WARN expression, per Yu Liao] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Acked-by: Zi Yan <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Frank van der Linden <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
78eb4ea2 |
| 24-Jul-2024 |
Joel Granados <[email protected]> |
sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ct
sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified.
This patch has been generated by the following coccinelle script:
``` virtual patch
@r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@ identifier func, ctl, write, buffer, lenp, ppos; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... }
@r3@ identifier func; @@
int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *);
@r4@ identifier func, ctl; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *);
@r5@ identifier func, write, buffer, lenp, ppos; @@
int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration.
Co-developed-by: Thomas Weißschuh <[email protected]> Signed-off-by: Thomas Weißschuh <[email protected]> Co-developed-by: Joel Granados <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.10 |
|
| #
27e6a24a |
| 11-Jul-2024 |
Paolo Bonzini <[email protected]> |
mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
The flags AS_UNMOVABLE and AS_INACCESSIBLE were both added just for guest_memfd; AS_UNMOVABLE is already in existing versions of Linux, while AS_INAC
mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
The flags AS_UNMOVABLE and AS_INACCESSIBLE were both added just for guest_memfd; AS_UNMOVABLE is already in existing versions of Linux, while AS_INACCESSIBLE was acked for inclusion in 6.11.
But really, they are the same thing: only guest_memfd uses them, at least for now, and guest_memfd pages are unmovable because they should not be accessed by the CPU.
So merge them into one; use the AS_INACCESSIBLE name which is more comprehensive. At the same time, this fixes an embarrassing bug where AS_INACCESSIBLE was used as a bit mask, despite it being just a bit index.
The bug was mostly benign, because AS_INACCESSIBLE's bit representation (1010) corresponded to setting AS_UNEVICTABLE (which is already set) and AS_ENOSPC (except no async writes can happen on the guest_memfd). So the AS_INACCESSIBLE flag simply had no effect.
Fixes: 1d23040caa8b ("KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode") Fixes: c72ceafbd12c ("mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory") Cc: [email protected] Acked-by: Vlastimil Babka <[email protected]> Acked-by: David Hildenbrand <[email protected]> Tested-by: Michael Roth <[email protected]> Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4 |
|
| #
34a023dc |
| 14-Jun-2024 |
Suren Baghdasaryan <[email protected]> |
mm: handle profiling for fake memory allocations during compaction
During compaction isolated free pages are marked allocated so that they can be split and/or freed. For that, post_alloc_hook() is
mm: handle profiling for fake memory allocations during compaction
During compaction isolated free pages are marked allocated so that they can be split and/or freed. For that, post_alloc_hook() is used inside split_map_pages() and release_free_list(). split_map_pages() marks free pages allocated, splits the pages and then lets alloc_contig_range_noprof() free those pages. release_free_list() marks free pages and immediately frees them. This usage of post_alloc_hook() affect memory allocation profiling because these functions might not be called from an instrumented allocator, therefore current->alloc_tag is NULL and when debugging is enabled (CONFIG_MEM_ALLOC_PROFILING_DEBUG=y) that causes warnings. To avoid that, wrap such post_alloc_hook() calls into an instrumented function which acts as an allocator which will be charged for these fake allocations. Note that these allocations are very short lived until they are freed, therefore the associated counters should usually read 0.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Sourav Panda <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2 |
|
| #
7998df0b |
| 28-Mar-2024 |
Joel Granados <[email protected]> |
memory: remove the now superfluous sentinel element from ctl_table array
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentin
memory: remove the now superfluous sentinel element from ctl_table array
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%[email protected]/)
Remove sentinel from all files under mm/ that register a sysctl table.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joel Granados <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc1 |
|
| #
b951aaff |
| 21-Mar-2024 |
Suren Baghdasaryan <[email protected]> |
mm: enable page allocation tagging
Redefine page allocators to record allocation tags upon their invocation. Instrument post_alloc_hook and free_pages_prepare to modify current allocation tag.
[su
mm: enable page allocation tagging
Redefine page allocators to record allocation tags upon their invocation. Instrument post_alloc_hook and free_pages_prepare to modify current allocation tag.
[[email protected]: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Co-developed-by: Kent Overstreet <[email protected]> Signed-off-by: Kent Overstreet <[email protected]> Reviewed-by: Kees Cook <[email protected]> Tested-by: Kees Cook <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alex Gaynor <[email protected]> Cc: Alice Ryhl <[email protected]> Cc: Andreas Hindborg <[email protected]> Cc: Benno Lossin <[email protected]> Cc: "Björn Roy Baron" <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Gary Guo <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wedson Almeida Filho <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7, v6.8-rc6 |
|
| #
803de900 |
| 21-Feb-2024 |
Vlastimil Babka <[email protected]> |
mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations
Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations tha
mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations
Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination can happen in a suspend/resume context where a GFP_KERNEL allocation can have __GFP_IO masked out via gfp_allowed_mask.
Quoting Sven:
1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set.
2. page alloc's __alloc_pages_slowpath tries to get a page from the freelist. This fails because there is nothing free of that costly order.
3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out because a zone is ready to be compacted; it pretends to have made a single page of progress.
4. page alloc tries to compact, but this always bails out early because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway).
5. page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction
6. goto 2. indefinite stall.
(end quote)
The immediate root cause is confusing the COMPACT_SKIPPED returned from __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be indicating a lack of order-0 pages, and in step 5 evaluating that in should_compact_retry() as a reason to retry, before incrementing and limiting the number of retries. There are however other places that wrongly assume that compaction can happen while we lack __GFP_IO.
To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO evaluation and switch the open-coded test in try_to_compact_pages() to use it.
Also use the new helper in: - compaction_ready(), which will make reclaim not bail out in step 3, so there's at least one attempt to actually reclaim, even if chances are small for a costly order - in_reclaim_compaction() which will make should_continue_reclaim() return false and we don't over-reclaim unnecessarily - in __alloc_pages_slowpath() to set a local variable can_compact, which is then used to avoid retrying reclaim/compaction for costly allocations (step 5) if we can't compact and also to skip the early compaction attempt that we do in some cases
Link: https://lkml.kernel.org/r/[email protected] Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"") Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Sven van Ashbrook <[email protected]> Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%[email protected]/ Tested-by: Karthikeyan Ramasubramanian <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Curtis Malainey <[email protected]> Cc: Jaroslav Kysela <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
73318e2c |
| 20-Feb-2024 |
Zi Yan <[email protected]> |
mm/compaction: optimize >0 order folio compaction with free page split.
During migration in a memory compaction, free pages are placed in an array of page lists based on their order. But the desire
mm/compaction: optimize >0 order folio compaction with free page split.
During migration in a memory compaction, free pages are placed in an array of page lists based on their order. But the desired free page order (i.e., the order of a source page) might not be always present, thus leading to migration failures and premature compaction termination. Split a high order free pages when source migration page has a lower order to increase migration successful rate.
Note: merging free pages when a migration fails and a lower order free page is returned via compaction_free() is possible, but there is too much work. Since the free pages are not buddy pages, it is hard to identify these free pages using existing PFN-based page merging algorithm.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Tested-by: Baolin Wang <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Adam Manzanares <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Huang Ying <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
733aea0b |
| 20-Feb-2024 |
Zi Yan <[email protected]> |
mm/compaction: add support for >0 order folio memory compaction.
Before last commit, memory compaction only migrates order-0 folios and skips >0 order folios. Last commit splits all >0 order folios
mm/compaction: add support for >0 order folio memory compaction.
Before last commit, memory compaction only migrates order-0 folios and skips >0 order folios. Last commit splits all >0 order folios during compaction. This commit migrates >0 order folios during compaction by keeping isolated free pages at their original size without splitting them into order-0 pages and using them directly during migration process.
What is different from the prior implementation: 1. All isolated free pages are kept in a NR_PAGE_ORDERS array of page lists, where each page list stores free pages in the same order. 2. All free pages are not post_alloc_hook() processed nor buddy pages, although their orders are stored in first page's private like buddy pages. 3. During migration, in new page allocation time (i.e., in compaction_alloc()), free pages are then processed by post_alloc_hook(). When migration fails and a new page is returned (i.e., in compaction_free()), free pages are restored by reversing the post_alloc_hook() operations using newly added free_pages_prepare_fpi_none().
Step 3 is done for a latter optimization that splitting and/or merging free pages during compaction becomes easier.
Note: without splitting free pages, compaction can end prematurely due to migration will return -ENOMEM even if there is free pages. This happens when no order-0 free page exist and compaction_alloc() return NULL.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Tested-by: Baolin Wang <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Adam Manzanares <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Huang Ying <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
ee6f62fd |
| 20-Feb-2024 |
Zi Yan <[email protected]> |
mm/compaction: enable compacting >0 order folios.
migrate_pages() supports >0 order folio migration and during compaction, even if compaction_alloc() cannot provide >0 order free pages, migrate_page
mm/compaction: enable compacting >0 order folios.
migrate_pages() supports >0 order folio migration and during compaction, even if compaction_alloc() cannot provide >0 order free pages, migrate_pages() can split the source page and try to migrate the base pages from the split. It can be a baseline and start point for adding support for compacting >0 order folios.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Suggested-by: Huang Ying <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Tested-by: Baolin Wang <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Adam Manzanares <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc5, v6.8-rc4 |
|
| #
f6f3f275 |
| 08-Feb-2024 |
Kefeng Wang <[email protected]> |
mm: compaction: early termination in compact_nodes()
No need to continue try compact memory if pending fatal signal, allow loop termination earlier in compact_nodes().
The existing fatal_signal_pen
mm: compaction: early termination in compact_nodes()
No need to continue try compact memory if pending fatal signal, allow loop termination earlier in compact_nodes().
The existing fatal_signal_pending() check does make compact_zone() break out of the while loop, but it still enters the next zone/next nid, and some unnecessary functions(eg, lru_add_drain) are called. There was no observable benefit from the new test, it is just found from code inspection when refactoring compact_node().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc3, v6.8-rc2 |
|
| #
1883e8ac |
| 22-Jan-2024 |
Baolin Wang <[email protected]> |
mm: compaction: limit the suitable target page order to be less than cc->order
It can not improve the fragmentation if we isolate the target free pages exceeding cc->order, especially when the cc->o
mm: compaction: limit the suitable target page order to be less than cc->order
It can not improve the fragmentation if we isolate the target free pages exceeding cc->order, especially when the cc->order is less than pageblock_order. For example, suppose the pageblock_order is MAX_ORDER (size is 4M) and cc->order is 2M THP size, we should not isolate other 2M free pages to be the migration target, which can not improve the fragmentation.
Moreover this is also applicable for large folio compaction.
Link: https://lkml.kernel.org/r/afcd9377351c259df7a25a388a4a0d5862b986f4.1705928395.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
3e40b3f4 |
| 08-Feb-2024 |
Kefeng Wang <[email protected]> |
mm: compaction: refactor compact_node()
Refactor compact_node() to handle both proactive and synchronous compact memory, which cleanups code a bit.
Link: https://lkml.kernel.org/r/20240208013607.17
mm: compaction: refactor compact_node()
Refactor compact_node() to handle both proactive and synchronous compact memory, which cleanups code a bit.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
ab755bf4 |
| 20-Feb-2024 |
Baolin Wang <[email protected]> |
mm: compaction: update the cc->nr_migratepages when allocating or freeing the freepages
Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison to ensure that enough freepages are
mm: compaction: update the cc->nr_migratepages when allocating or freeing the freepages
Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison to ensure that enough freepages are isolated in isolate_freepages(), however it just decreases the cc->nr_freepages without updating cc->nr_migratepages in compaction_alloc(), which will waste more CPU cycles and cause too many freepages to be isolated.
So we should also update the cc->nr_migratepages when allocating or freeing the freepages to avoid isolating excess freepages. And I can see fewer free pages are scanned and isolated when running thpcompact on my Arm64 server:
k6.7 k6.7_patched Ops Compaction pages isolated 120692036.00 118160797.00 Ops Compaction migrate scanned 131210329.00 154093268.00 Ops Compaction free scanned 1090587971.00 1080632536.00 Ops Compact scan efficiency 12.03 14.26
Moreover, I did not see an obvious latency improvements, this is likely because isolating freepages is not the bottleneck in the thpcompact test case.
k6.7 k6.7_patched Amean fault-both-1 1089.76 ( 0.00%) 1080.16 * 0.88%* Amean fault-both-3 1616.48 ( 0.00%) 1636.65 * -1.25%* Amean fault-both-5 2266.66 ( 0.00%) 2219.20 * 2.09%* Amean fault-both-7 2909.84 ( 0.00%) 2801.90 * 3.71%* Amean fault-both-12 4861.26 ( 0.00%) 4733.25 * 2.63%* Amean fault-both-18 7351.11 ( 0.00%) 6950.51 * 5.45%* Amean fault-both-24 9059.30 ( 0.00%) 9159.99 * -1.11%* Amean fault-both-30 10685.68 ( 0.00%) 11399.02 * -6.68%*
Link: https://lkml.kernel.org/r/6440493f18da82298152b6305d6b41c2962a3ce6.1708409245.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|