|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7 |
|
| #
b487a2da |
| 13-Mar-2025 |
Kairui Song <[email protected]> |
mm, swap: simplify folio swap allocation
With slot cache gone, clean up the allocation helpers even more. folio_alloc_swap will be the only entry for allocation and adding the folio to swap cache (
mm, swap: simplify folio swap allocation
With slot cache gone, clean up the allocation helpers even more. folio_alloc_swap will be the only entry for allocation and adding the folio to swap cache (except suspend), making it opposite of folio_free_swap.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
0ff67f99 |
| 13-Mar-2025 |
Kairui Song <[email protected]> |
mm, swap: remove swap slot cache
Slot cache is no longer needed now, removing it and all related code.
- vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`, 12G memory cgroup using simula
mm, swap: remove swap slot cache
Slot cache is no longer needed now, removing it and all related code.
- vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`, 12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs), 16 test runs for each case, measuring the total throughput:
Before (KB/s) (stdev) After (KB/s) (stdev) Random (4K): 424907.60 (24410.78) 414745.92 (34554.78) Random (64K): 163308.82 (11635.72) 167314.50 (18434.99) Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16)
The performance changes are below noise level.
- Build linux kernel with make -j96, using 4K folio with 1.5G memory cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs, 12 test runs, measuring the system time:
Before (s) (stdev) After (s) (stdev) make -j96 (4K): 6445.69 (61.95) 6408.80 (69.46) make -j96 (64K): 6841.71 (409.04) 6437.99 (435.55)
Similar to above, 64k mTHP case showed a slight improvement.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
1b7e9002 |
| 13-Mar-2025 |
Kairui Song <[email protected]> |
mm, swap: use percpu cluster as allocation fast path
Current allocation workflow first traverses the plist with a global lock held, after choosing a device, it uses the percpu cluster on that swap d
mm, swap: use percpu cluster as allocation fast path
Current allocation workflow first traverses the plist with a global lock held, after choosing a device, it uses the percpu cluster on that swap device. This commit moves the percpu cluster variable out of being tied to individual swap devices, making it a global percpu variable, and will be used directly for allocation as a fast path.
The global percpu cluster variable will never point to a HDD device, and allocations on a HDD device are still globally serialized.
This improves the allocator performance and prepares for removal of the slot cache in later commits. There shouldn't be much observable behavior change, except one thing: this changes how swap device allocation rotation works.
Currently, each allocation will rotate the plist, and because of the existence of slot cache (one order 0 allocation usually returns 64 entries), swap devices of the same priority are rotated for every 64 order 0 entries consumed. High order allocations are different, they will bypass the slot cache, and so swap device is rotated for every 16K, 32K, or up to 2M allocation.
The rotation rule was never clearly defined or documented, it was changed several times without mentioning.
After this commit, and once slot cache is gone in later commits, swap device rotation will happen for every consumed cluster. Ideally non-HDD devices will be rotated if 2M space has been consumed for each order. Fragmented clusters will rotate the device faster, which seems OK. HDD devices is rotated for every allocation regardless of the allocation order, which should be OK too and trivial.
This commit also slightly changes allocation behaviour for slot cache. The new added cluster allocation fast path may allocate entries from different device to the slot cache, this is not observable from user space, only impact performance very slightly, and slot cache will be just gone in next commit, so this can be ignored.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
c523aa89 |
| 05-Feb-2025 |
Baoquan He <[email protected]> |
mm/swap: rename swap_swapcount() to swap_entry_swapped()
The new function name can reflect the real behaviour of the function more clearly and more accurately. And the renaming avoids the confusion
mm/swap: rename swap_swapcount() to swap_entry_swapped()
The new function name can reflect the real behaviour of the function more clearly and more accurately. And the renaming avoids the confusion between swap_swapcount() and swp_swapcount().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Cc: Chris Li <[email protected]> Cc: Kairui Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
0eb7d2c3 |
| 05-Feb-2025 |
Baoquan He <[email protected]> |
mm/swap: remove SWAP_FLAG_PRIO_SHIFT
It doesn't make sense to have a zero value of shift. Remove it to avoid confusion.
Link: https://lkml.kernel.org/r/[email protected] Signed-
mm/swap: remove SWAP_FLAG_PRIO_SHIFT
It doesn't make sense to have a zero value of shift. Remove it to avoid confusion.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Cc: Chris Li <[email protected]> Cc: Kairui Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
c25465eb |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm: use single SWP_DEVICE_EXCLUSIVE entry type
There is no need for the distinction anymore; let's merge the readable and writable device-exclusive entries into a single device-exclusive entry type.
mm: use single SWP_DEVICE_EXCLUSIVE entry type
There is no need for the distinction anymore; let's merge the readable and writable device-exclusive entries into a single device-exclusive entry type.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Simona Vetter <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1 |
|
| #
89ce924f |
| 24-Jan-2025 |
Johannes Weiner <[email protected]> |
mm: memcontrol: move memsw charge callbacks to v1
The interweaving of two entirely different swap accounting strategies has been one of the more confusing parts of the memcg code. Split out the v1
mm: memcontrol: move memsw charge callbacks to v1
The interweaving of two entirely different swap accounting strategies has been one of the more confusing parts of the memcg code. Split out the v1 code to clarify the implementation and a handful of callsites, and to avoid building the v1 bits when !CONFIG_MEMCG_V1.
text data bss dec hex filename 39253 6446 4160 49859 c2c3 mm/memcontrol.o.old 38877 6382 4160 49419 c10b mm/memcontrol.o
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Balbir Singh <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7 |
|
| #
538d5baa |
| 11-Jan-2025 |
Kaixiong Yu <[email protected]> |
mm: vmscan: move vmscan sysctls to mm/vmscan.c
This moves vm_swappiness and zone_reclaim_mode to mm/vmscan.c, as part of the kernel/sysctl.c cleaning, also moves some external variable declarations
mm: vmscan: move vmscan sysctls to mm/vmscan.c
This moves vm_swappiness and zone_reclaim_mode to mm/vmscan.c, as part of the kernel/sysctl.c cleaning, also moves some external variable declarations and function declarations from include/linux/swap.h into mm/internal.h.
Signed-off-by: Kaixiong Yu <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
1c7b17cf |
| 19-Nov-2024 |
liuye <[email protected]> |
mm/vmscan: fix hard LOCKUP in function isolate_lru_folios
This fixes the following hard lockup in isolate_lru_folios() during memory reclaim. If the LRU mostly contains ineligible folios this may t
mm/vmscan: fix hard LOCKUP in function isolate_lru_folios
This fixes the following hard lockup in isolate_lru_folios() during memory reclaim. If the LRU mostly contains ineligible folios this may trigger watchdog.
watchdog: Watchdog detected hard LOCKUP on cpu 173 RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0 Call Trace: _raw_spin_lock_irqsave+0x31/0x40 folio_lruvec_lock_irqsave+0x5f/0x90 folio_batch_move_lru+0x91/0x150 lru_add_drain_per_cpu+0x1c/0x40 process_one_work+0x17d/0x350 worker_thread+0x27b/0x3a0 kthread+0xe8/0x120 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1b/0x30
lruvec->lru_lock owner:
PID: 2865 TASK: ffff888139214d40 CPU: 40 COMMAND: "kswapd0" #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555 #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171 #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920 #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4 #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde [exception RIP: isolate_lru_folios+403] RIP: ffffffffa597df53 RSP: ffffc90006fb7c28 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffffc90006fb7c60 RCX: ffffea04a2196f88 RDX: ffffc90006fb7c60 RSI: ffffc90006fb7c60 RDI: ffffea04a2197048 RBP: ffff88812cbd3010 R8: ffffea04a2197008 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffffea04a2197008 R13: ffffea04a2197048 R14: ffffc90006fb7de8 R15: 0000000003e3e937 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 <NMI exception stack> #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788 #7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0 #8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354 #9 [ffffc90006fb7ef8] kthread at ffffffffa5748238 crash>
Scenario: User processe are requesting a large amount of memory and keep page active. Then a module continuously requests memory from ZONE_DMA32 area. Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached. However pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area.
Reproduce: Terminal 1: Construct to continuously increase pages active(anon). mkdir /tmp/memory mount -t tmpfs -o size=1024000M tmpfs /tmp/memory dd if=/dev/zero of=/tmp/memory/block bs=4M tail /tmp/memory/block
Terminal 2: vmstat -a 1 active will increase. procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ... r b swpd free inact active si so bi bo 1 0 0 1445623076 45898836 83646008 0 0 0 1 0 0 1445623076 43450228 86094616 0 0 0 1 0 0 1445623076 41003480 88541364 0 0 0 1 0 0 1445623076 38557088 90987756 0 0 0 1 0 0 1445623076 36109688 93435156 0 0 0 1 0 0 1445619552 33663256 95881632 0 0 0 1 0 0 1445619804 31217140 98327792 0 0 0 1 0 0 1445619804 28769988 100774944 0 0 0 1 0 0 1445619804 26322348 103222584 0 0 0 1 0 0 1445619804 23875592 105669340 0 0 0
cat /proc/meminfo | head Active(anon) increase. MemTotal: 1579941036 kB MemFree: 1445618500 kB MemAvailable: 1453013224 kB Buffers: 6516 kB Cached: 128653956 kB SwapCached: 0 kB Active: 118110812 kB Inactive: 11436620 kB Active(anon): 115345744 kB Inactive(anon): 945292 kB
When the Active(anon) is 115345744 kB, insmod module triggers the ZONE_DMA32 watermark.
perf record -e vmscan:mm_vmscan_lru_isolate -aR perf script isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2 nr_skipped=2 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29 nr_skipped=29 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon
See nr_scanned=28835844. 28835844 * 4k = 115343376KB approximately equal to 115345744 kB.
If increase Active(anon) to 1000G then insmod module triggers the ZONE_DMA32 watermark. hard lockup will occur.
In my device nr_scanned = 0000000003e3e937 when hard lockup. Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB.
[ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 ffffc90006fb7c30: 0000000000000020 0000000000000000 ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000 ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8 ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48 ffffc90006fb7c70: 0000000000000000 0000000000000000 ffffc90006fb7c80: 0000000000000000 0000000000000000 ffffc90006fb7c90: 0000000000000000 0000000000000000 ffffc90006fb7ca0: 0000000000000000 0000000003e3e937 ffffc90006fb7cb0: 0000000000000000 0000000000000000 ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000
About the Fixes: Why did it take eight years to be discovered?
The problem requires the following conditions to occur: 1. The device memory should be large enough. 2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. 3. The memory in ZONE_DMA32 needs to reach the watermark.
If the memory is not large enough, or if the usage design of ZONE_DMA32 area memory is reasonable, this problem is difficult to detect.
notes: The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL, but other suitable scenarios may also trigger the problem.
Link: https://lkml.kernel.org/r/[email protected] Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis") Signed-off-by: liuye <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
bae8a4ef |
| 13-Jan-2025 |
Kairui Song <[email protected]> |
mm, swap: use a global swap cluster for non-rotation devices
Non-rotational devices (SSD / ZRAM) can tolerate fragmentation, so the goal of the SWAP allocator is to avoid contention for clusters. I
mm, swap: use a global swap cluster for non-rotation devices
Non-rotational devices (SSD / ZRAM) can tolerate fragmentation, so the goal of the SWAP allocator is to avoid contention for clusters. It uses a per-CPU cluster design, and each CPU will use a different cluster as much as possible.
However, HDDs are very sensitive to fragmentation, contention is trivial in comparison. Therefore, we use one global cluster instead. This ensures that each order will be written to the same cluster as much as possible, which helps make the I/O more continuous.
This ensures that the performance of the cluster allocator is as good as that of the old allocator. Tests after this commit compared to those before this series:
Tested using 'make -j32' with tinyconfig, a 1G memcg limit, and HDD swap:
make -j32 with tinyconfig, using 1G memcg limit and HDD swap:
Before this series: 114.44user 29.11system 39:42.90elapsed 6%CPU (0avgtext+0avgdata 157284maxresident)k 2901232inputs+0outputs (238877major+4227640minor)pagefaults
After this commit: 113.90user 23.81system 38:11.77elapsed 6%CPU (0avgtext+0avgdata 157260maxresident)k 2548728inputs+0outputs (235471major+4238110minor)pagefaults
[[email protected]: check kmalloc() return in setup_clusters] Link: https://lkml.kernel.org/r/CAMgjq7Au+o04ckHyT=iU-wVx9az=t0B-ZiC5E0bDqNrAtNOP-g@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Suggested-by: Chris Li <[email protected]> Cc: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickens <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
e3ae2dec |
| 13-Jan-2025 |
Kairui Song <[email protected]> |
mm, swap: simplify percpu cluster updating
Instead of using a returning argument, we can simply store the next cluster offset to the fixed percpu location, which reduce the stack usage and simplify
mm, swap: simplify percpu cluster updating
Instead of using a returning argument, we can simply store the next cluster offset to the fixed percpu location, which reduce the stack usage and simplify the function:
Object size: ./scripts/bloat-o-meter mm/swapfile.o mm/swapfile.o.new add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-271 (-271) Function old new delta get_swap_pages 2847 2733 -114 alloc_swap_scan_cluster 894 737 -157 Total: Before=30833, After=30562, chg -0.88%
Stack usage: Before: swapfile.c:1190:5:get_swap_pages 240 static
After: swapfile.c:1185:5:get_swap_pages 216 static
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Cc: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: Chis Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickens <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
3b644773 |
| 13-Jan-2025 |
Kairui Song <[email protected]> |
mm, swap: reduce contention on device lock
Currently, swap locking is mainly composed of two locks: the cluster lock (ci->lock) and the device lock (si->lock).
The cluster lock is much more fine-gr
mm, swap: reduce contention on device lock
Currently, swap locking is mainly composed of two locks: the cluster lock (ci->lock) and the device lock (si->lock).
The cluster lock is much more fine-grained, so it is best to use ci->lock instead of si->lock as much as possible.
We have cleaned up other hard dependencies on si->lock. Following the new cluster allocator design, most operations don't need to touch si->lock at all. In practice, we only need to take si->lock when moving clusters between lists.
To achieve this, this commit reworks the locking pattern of all si->lock and ci->lock users, eliminates all usage of ci->lock inside si->lock, and introduces a new design to avoid touching si->lock unless needed.
For minimal contention and easier understanding of the system, two ideas are introduced with the corresponding helpers: isolation and relocation.
- Clusters will be `isolated` from the list when iterating the list to search for an allocatable cluster.
This ensures other CPUs won't walk into the same cluster easily, and it releases si->lock after acquiring ci->lock, providing the only place that handles the inversion of two locks, and avoids contention.
Iterating the cluster list almost always moves the cluster (free -> nonfull, nonfull -> frag, frag -> frag tail), but it doesn't know where the cluster should be moved to until scanning is done. So keeping the cluster off-list is a good option with low overhead.
The off-list time window of a cluster is also minimal. In the worst case, one CPU will return the cluster after scanning the 512 entries on it, which we used to busy wait with a spin lock.
This is done with the new helper `isolate_lock_cluster`.
- Clusters will be `relocated` after allocation or freeing, according to their usage count and status.
Allocations no longer hold si->lock now, and may drop ci->lock for reclaim, so the cluster could be moved to any location while no lock is held. Besides, isolation clears all flags when it takes the cluster off the list (the flags must be in sync with the list status, so cluster users don't need to touch si->lock for checking its list status). So the cluster has to be relocated to the right list according to its usage after allocation or freeing.
Relocation is optional, if the cluster flags indicate it's already on the right list, it will skip touching the list or si->lock.
This is done with `relocate_cluster` after allocation or with `[partial_]free_cluster` after freeing.
This handled usage of all kinds of clusters in a clean way.
Scanning and allocation by iterating the cluster list is handled by "isolate - <scan / allocate> - relocate".
Scanning and allocation of per-CPU clusters will only involve "<scan / allocate> - relocate", as it knows which cluster to lock and use.
Freeing will only involve "relocate".
Each CPU will keep using its per-CPU cluster until the 512 entries are all consumed. Freeing also has to free 512 entries to trigger cluster movement in the best case, so si->lock is rarely touched.
Testing with building the Linux kernel with defconfig showed huge improvement:
tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C: Before: Sys time: 73578.30, Real time: 864.05 After: (-50.7% sys time, -44.8% real time) Sys time: 36227.49, Real time: 476.66
time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C: (avg of 4 test run) Before: Sys time: 74044.85, Real time: 846.51 hugepages-64kB/stats/swpout: 1735216 hugepages-64kB/stats/swpout_fallback: 430333
After: (-40.4% sys time, -37.1% real time) Sys time: 44160.56, Real time: 532.07 hugepages-64kB/stats/swpout: 1786288 hugepages-64kB/stats/swpout_fallback: 243384
time make -j32 / 512M memcg, 4K pages, 5G ZRAM, on AMD 7K62: Before: Sys time: 8098.21, Real time: 401.3 After: (-22.6% sys time, -12.8% real time ) Sys time: 6265.02, Real time: 349.83
The allocation success rate also slightly improved as we sanitized the usage of clusters with new defined helpers, previously dropping si->lock or ci->lock during scan will cause cluster order shuffle.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Suggested-by: Chris Li <[email protected]> Cc: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickens <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
3494d184 |
| 13-Jan-2025 |
Kairui Song <[email protected]> |
mm, swap: use an enum to define all cluster flags and wrap flags changes
Currently, we are only using flags to indicate which list the cluster is on. Using one bit for each list type might be a was
mm, swap: use an enum to define all cluster flags and wrap flags changes
Currently, we are only using flags to indicate which list the cluster is on. Using one bit for each list type might be a waste, as the list type grows, we will consume too many bits. Additionally, the current mixed usage of '&' and '==' is a bit confusing.
Make it clean by using an enum to define all possible cluster statuses. Only an off-list cluster will have the NONE (0) flag. And use a wrapper to annotate and sanitize all flag settings and list movements.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Suggested-by: Chris Li <[email protected]> Cc: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickens <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
9a0ddeb7 |
| 13-Jan-2025 |
Kairui Song <[email protected]> |
mm, swap: hold a reference during scan and cleanup flag usage
The flag SWP_SCANNING was used as an indicator of whether a device is being scanned for allocation, and prevents swapoff. Combined with
mm, swap: hold a reference during scan and cleanup flag usage
The flag SWP_SCANNING was used as an indicator of whether a device is being scanned for allocation, and prevents swapoff. Combined with SWP_WRITEOK, they work as a set of barriers for a clean swapoff:
1. Swapoff clears SWP_WRITEOK, allocation requests will see ~SWP_WRITEOK and abort as it's serialized by si->lock. 2. Swapoff unuses all allocated entries. 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing allocations will stop, preventing UAF. 4. Now swapoff can free everything safely.
This will make the allocation path have a hard dependency on si->lock. Allocation always have to acquire si->lock first for setting SWP_SCANNING and checking SWP_WRITEOK.
This commit removes this flag, and just uses the existing per-CPU refcount instead to prevent UAF in step 3, which serves well for such usage without dependency on si->lock, and scales very well too. Just hold a reference during the whole scan and allocation process. Swapoff will kill and wait for the counter.
And for preventing any allocation from happening after step 1 so the unuse in step 2 can ensure all slots are free, swapoff will acquire the ci->lock of each cluster one by one to ensure all allocations see ~SWP_WRITEOK and abort.
This way these dependences on si->lock are gone. And worth noting we can't kill the refcount as the first step for swapoff as the unuse process have to acquire the refcount.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Cc: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: Chis Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickens <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
b228386c |
| 13-Jan-2025 |
Kairui Song <[email protected]> |
mm, swap: clean up plist removal and adding
When the swap device is full (inuse_pages == pages), it should be removed from the allocation available plist. If any slot is freed, the swap device shou
mm, swap: clean up plist removal and adding
When the swap device is full (inuse_pages == pages), it should be removed from the allocation available plist. If any slot is freed, the swap device should be added back to the plist. Additionally, during swapon or swapoff, the swap device is forcefully added or removed.
Currently, the condition (inuse_pages == pages) is checked after every counter update, then remove or add the device accordingly. This is serialized by si->lock.
This commit decouples it from the protection of si->lock and reworked plist removal and adding, making it possible to get rid of the hard dependency on si->lock in allocation path in later commits.
To achieve this, simply using another lock is not an optimal approach, as the overhead is observable for a hot counter, and may cause complex locking issues. Thus, this commit manages to make it a lock-free atomic operation, by embedding the plist state into the second highest bit of the atomic counter.
Simply making the counter an atomic will not work, if the update and plist status check are not performed atomically, we may miss an addition or removal. With the embedded info we can update the counter and check the plist status with single atomic operations, and avoid any extra overheads:
If the counter is full (inuse_pages == pages) and the off-list bit is unset, we attempt to remove it from the plist. If the counter is not full (inuse_pages != pages) and the off-list bit is set, we attempt to add it to the plist. Removing, adding and bit update is serialized with a lock, which is a cold path. Ordinary counter updates will be lock-free.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Cc: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: Chis Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickens <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
27701521 |
| 13-Jan-2025 |
Kairui Song <[email protected]> |
mm, swap: clean up device availability check
Remove highest_bit and lowest_bit. After the HDD allocation path has been removed, the only purpose of these two fields is to determine whether the devi
mm, swap: clean up device availability check
Remove highest_bit and lowest_bit. After the HDD allocation path has been removed, the only purpose of these two fields is to determine whether the device is full or not, which can instead be determined by checking the inuse_pages.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: Chis Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickens <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
72774330 |
| 13-Jan-2025 |
Kairui Song <[email protected]> |
mm, swap: remove old allocation path for HDD
We are currently using different swap allocation algorithm for HDD and non-HDD. This leads to the existence of a different set of locks, and the code pa
mm, swap: remove old allocation path for HDD
We are currently using different swap allocation algorithm for HDD and non-HDD. This leads to the existence of a different set of locks, and the code path is heavily bloated, causing difficulties for further optimization and maintenance.
This commit removes all HDD swap allocation and related dead code, and uses the cluster allocation algorithm instead.
The performance may drop temporarily, but this should be negligible: The main advantage of the legacy HDD allocation algorithm is that it tends to use continuous slots, but swap device gets fragmented quickly anyway, and the attempt to use continuous slots will fail easily.
This commit also enables mTHP swap on HDD, which is expected to be beneficial, and following commits will adapt and optimize the cluster allocator for HDD.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Suggested-by: Chris Li <[email protected]> Suggested-by: "Huang, Ying" <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: Hugh Dickens <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5 |
|
| #
5168a68e |
| 22-Oct-2024 |
Kairui Song <[email protected]> |
mm, swap: avoid over reclaim of full clusters
When running low on usable slots, cluster allocator will try to reclaim the full clusters aggressively to reclaim HAS_CACHE slots. This guarantees that
mm, swap: avoid over reclaim of full clusters
When running low on usable slots, cluster allocator will try to reclaim the full clusters aggressively to reclaim HAS_CACHE slots. This guarantees that as long as there are any usable slots, HAS_CACHE or not, the swap device will be usable and workload won't go OOM early.
Before the cluster allocator, swap allocator fails easily if device is filled up with reclaimable HAS_CACHE slots. Which can be easily reproduced with following simple program:
#include <stdio.h> #include <string.h> #include <linux/mman.h> #include <sys/mman.h> #define SIZE 8192UL * 1024UL * 1024UL int main(int argc, char **argv) { long tmp; char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(p, 0, SIZE); madvise(p, SIZE, MADV_PAGEOUT); for (unsigned long i = 0; i < SIZE; ++i) tmp += p[i]; getchar(); /* Pause */ return 0; }
Setup an 8G non ramdisk swap, the first run of the program will swapout 8G ram successfully. But run same program again after the first run paused, the second run can't swapout all 8G memory as now half of the swap device is pinned by HAS_CACHE. There was a random scan in the old allocator that may reclaim part of the HAS_CACHE by luck, but it's unreliable.
The new allocator's added reclaim of full clusters when device is low on usable slots. But when multiple CPUs are seeing the device is low on usable slots at the same time, they ran into a thundering herd problem.
This is an observable problem on large machine with mass parallel workload, as full cluster reclaim is slower on large swap device and higher number of CPUs will also make things worse.
Testing using a 128G ZRAM on a 48c96t system. When the swap device is very close to full (eg. 124G / 128G), running build linux kernel with make -j96 in a 1G memory cgroup will hung (not a softlockup though) spinning in full cluster reclaim for about ~5min before go OOM.
To solve this, split the full reclaim into two parts:
- Instead of do a synchronous aggressively reclaim when device is low, do only one aggressively reclaim when device is strictly full with a kworker. This still ensures in worst case the device won't be unusable because of HAS_CACHE slots.
- To avoid allocation (especially higher order) suffer from HAS_CACHE filling up clusters and kworker not responsive enough, do one synchronous scan every time the free list is drained, and only scan one cluster. This is kind of similar to the random reclaim before, keeps the full clusters rotated and has a minimal latency. This should provide a fair reclaim strategy suitable for most workloads.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 2cacbdfdee65 ("mm: swap: add a adaptive full cluster cache reclaim") Signed-off-by: Kairui Song <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5 |
|
| #
0ca0c24e |
| 23-Aug-2024 |
Usama Arif <[email protected]> |
mm: store zero pages to be swapped out in a bitmap
Patch series "mm: store zero pages to be swapped out in a bitmap", v8.
As shown in the patch series that introduced the zswap same-filled optimiza
mm: store zero pages to be swapped out in a bitmap
Patch series "mm: store zero pages to be swapped out in a bitmap", v8.
As shown in the patch series that introduced the zswap same-filled optimization [1], 10-20% of the pages stored in zswap are same-filled. This is also observed across Meta's server fleet. By using VM counters in swap_writepage (not included in this patchseries) it was found that less than 1% of the same-filled pages to be swapped out are non-zero pages.
For conventional swap setup (without zswap), rather than reading/writing these pages to flash resulting in increased I/O and flash wear, a bitmap can be used to mark these pages as zero at write time, and the pages can be filled at read time if the bit corresponding to the page is set.
When using zswap with swap, this also means that a zswap_entry does not need to be allocated for zero filled pages resulting in memory savings which would offset the memory used for the bitmap.
A similar attempt was made earlier in [2] where zswap would only track zero-filled pages instead of same-filled. This patchseries adds zero-filled pages optimization to swap (hence it can be used even if zswap is disabled) and removes the same-filled code from zswap (as only 1% of the same-filled pages are non-zero), simplifying code.
[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ [2] https://lore.kernel.org/lkml/[email protected]/
This patch (of 2):
Approximately 10-20% of pages to be swapped out are zero pages [1]. Rather than reading/writing these pages to flash resulting in increased I/O and flash wear, a bitmap can be used to mark these pages as zero at write time, and the pages can be filled at read time if the bit corresponding to the page is set. With this patch, NVMe writes in Meta server fleet decreased by almost 10% with conventional swap setup (zswap disabled).
[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Usama Arif <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Usama Arif <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4 |
|
| #
65018076 |
| 12-Aug-2024 |
Baolin Wang <[email protected]> |
mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting
Patch series "support large folio swap-out and swap-in for shmem", v5.
Shmem will support large folio allocation [1]
mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting
Patch series "support large folio swap-out and swap-in for shmem", v5.
Shmem will support large folio allocation [1] [2] to get a better performance, however, the memory reclaim still splits the precious large folios when trying to swap-out shmem, which may lead to the memory fragmentation issue and can not take advantage of the large folio for shmeme.
Moreover, the swap code already supports for swapping out large folio without split, and large folio swap-in[3] series is queued into mm-unstable branch. Hence this patch set also supports the large folio swap-out and swap-in for shmem.
This patch (of 9):
To support shmem large folio swap operations, add a new parameter to swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for shmem swap entries.
While we are at it, using folio_nr_pages() to get the number of pages of the folio as a preparation.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/99f64115d04b285e009580eb177352c57119ffd0.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: Daniel Gomez <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Pankaj Raghav <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc3, v6.11-rc2 |
|
| #
2cacbdfd |
| 31-Jul-2024 |
Kairui Song <[email protected]> |
mm: swap: add a adaptive full cluster cache reclaim
Link all full cluster with one full list, and reclaim from it when the allocation have ran out of all usable clusters.
There are many reason a fo
mm: swap: add a adaptive full cluster cache reclaim
Link all full cluster with one full list, and reclaim from it when the allocation have ran out of all usable clusters.
There are many reason a folio can end up being in the swap cache while having no swap count reference. So the best way to search for such slots is still by iterating the swap clusters.
With the list as an LRU, iterating from the oldest cluster and keep them rotating is a very doable and clean way to free up potentially not inuse clusters.
When any allocation failure, try reclaim and rotate only one cluster. This is adaptive for high order allocations they can tolerate fallback. So this avoids latency, and give the full cluster list an fair chance to get reclaimed. It release the usage stress for the fallback order 0 allocation or following up high order allocation.
If the swap device is getting very full, reclaim more aggresively to ensure no OOM will happen. This ensures order 0 heavy workload won't go OOM as order 0 won't fail if any cluster still have any space.
[[email protected]: fix discard of full cluster] Link: https://lkml.kernel.org/r/CAMgjq7CWwK75_2Zi5P40K08pk9iqOcuWKL6khu=x4Yg_nXaQag@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Reported-by: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Kairui Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
661383c6 |
| 31-Jul-2024 |
Kairui Song <[email protected]> |
mm: swap: relaim the cached parts that got scanned
This commit implements reclaim during scan for cluster allocator.
Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could result i
mm: swap: relaim the cached parts that got scanned
This commit implements reclaim during scan for cluster allocator.
Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could result in low allocation success rate or early OOM.
So to ensure maximum allocation success rate, integrate reclaiming with scanning. If found a range of suitable swap slots but fragmented due to HAS_CACHE, just try to reclaim the slots.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Reported-by: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
477cb7ba |
| 31-Jul-2024 |
Kairui Song <[email protected]> |
mm: swap: add a fragment cluster list
Now swap cluster allocator arranges the clusters in LRU style, so the "cold" cluster stay at the head of nonfull lists are the ones that were used for allocatio
mm: swap: add a fragment cluster list
Now swap cluster allocator arranges the clusters in LRU style, so the "cold" cluster stay at the head of nonfull lists are the ones that were used for allocation long time ago and still partially occupied. So if allocator can't find enough contiguous slots to satisfy an high order allocation, it's unlikely there will be slot being free on them to satisfy the allocation, at least in a short period.
As a result, nonfull cluster scanning will waste time repeatly scanning the unusable head of the list.
Also, multiple CPUs could content on the same head cluster of nonfull list. Unlike free clusters which are removed from the list when any CPU starts using it, nonfull cluster stays on the head.
So introduce a new list frag list, all scanned nonfull clusters will be moved to this list. Both for avoiding repeated scanning and contention.
Frag list is still used as fallback for allocations, so if one CPU failed to allocate one order of slots, it can still steal other CPU's clusters. And order 0 will favor the fragmented clusters to better protect nonfull clusters
If any slots on a fragment list are being freed, move the fragment list back to nonfull list indicating it worth another scan on the cluster. Compared to scan upon freeing a slot, this keep the scanning lazy and save some CPU if there are still other clusters to use.
It may seems unneccessay to keep the fragmented cluster on list at all if they can't be used for specific order allocation. But this will start to make sense once reclaim dring scanning is ready.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Reported-by: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
d07a46a4 |
| 31-Jul-2024 |
Chris Li <[email protected]> |
mm: swap: mTHP allocate swap entries from nonfull list
Track the nonfull cluster as well as the empty cluster on lists. Each order has one nonfull cluster list.
The cluster will remember which ord
mm: swap: mTHP allocate swap entries from nonfull list
Track the nonfull cluster as well as the empty cluster on lists. Each order has one nonfull cluster list.
The cluster will remember which order it was used during new cluster allocation.
When the cluster has free entry, add to the nonfull[order] list. When the free cluster list is empty, also allocate from the nonempty list of that order.
This improves the mTHP swap allocation success rate.
There are limitations if the distribution of numbers of different orders of mTHP changes a lot. e.g. there are a lot of nonfull cluster assign to order A while later time there are a lot of order B allocation while very little allocation in order A. Currently the cluster used by order A will not reused by order B unless the cluster is 100% empty.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chris Li <[email protected]> Reported-by: Barry Song <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
73ed0baa |
| 31-Jul-2024 |
Chris Li <[email protected]> |
mm: swap: swap cluster switch to double link list
Patch series "mm: swap: mTHP swap allocator base on swap cluster order", v5.
This is the short term solutions "swap cluster order" listed in my "Sw
mm: swap: swap cluster switch to double link list
Patch series "mm: swap: mTHP swap allocator base on swap cluster order", v5.
This is the short term solutions "swap cluster order" listed in my "Swap Abstraction" discussion slice 8 in the recent LSF/MM conference.
When commit 845982eb264bc "mm: swap: allow storage of all mTHP orders" is introduced, it only allocates the mTHP swap entries from the new empty cluster list. It has a fragmentation issue reported by Barry.
https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
The reason is that all the empty clusters have been exhausted while there are plenty of free swap entries in the cluster that are not 100% free.
Remember the swap allocation order in the cluster. Keep track of the per order non full cluster list for later allocation.
This series gives the swap SSD allocation a new separate code path from the HDD allocation. The new allocator use cluster list only and do not global scan swap_map[] without lock any more.
This streamline the swap allocation for SSD. The code matches the execution flow much better.
User impact: For users that allocate and free mix order mTHP swapping, It greatly improves the success rate of the mTHP swap allocation after the initial phase.
It also performs faster when the swapfile is close to full, because the allocator can get the non full cluster from a list rather than scanning a lot of swap_map entries.
With Barry's mthp test program V2:
Without: $ ./thp_swap_allocator_test -a Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71% Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00% Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00% ... Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00% Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
$ ./thp_swap_allocator_test -a -s Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00% Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00% .. Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00% Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
$ ./thp_swap_allocator_test -s Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00% Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00% .. Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00% Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
$ ./thp_swap_allocator_test Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00% Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00% .. Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00% Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
With: # with all 0.00% filter out $ ./thp_swap_allocator_test -a | grep -v "0.00%" $ # all result are 0.00%
$ ./thp_swap_allocator_test -a -s | grep -v "0.00%" ./thp_swap_allocator_test -a -s | grep -v "0.00%" Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33% Iteration 19: swpout inc: 219, swpout fallback inc: 7, Fallback percentage: 3.10% Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 29: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 34: swpout inc: 220, swpout fallback inc: 8, Fallback percentage: 3.51% Iteration 35: swpout inc: 222, swpout fallback inc: 11, Fallback percentage: 4.72% Iteration 38: swpout inc: 217, swpout fallback inc: 4, Fallback percentage: 1.81% Iteration 40: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63% Iteration 42: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90% Iteration 43: swpout inc: 215, swpout fallback inc: 7, Fallback percentage: 3.15% Iteration 47: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88% Iteration 49: swpout inc: 217, swpout fallback inc: 1, Fallback percentage: 0.46% Iteration 52: swpout inc: 221, swpout fallback inc: 8, Fallback percentage: 3.49% Iteration 56: swpout inc: 224, swpout fallback inc: 4, Fallback percentage: 1.75% Iteration 58: swpout inc: 214, swpout fallback inc: 5, Fallback percentage: 2.28% Iteration 62: swpout inc: 220, swpout fallback inc: 3, Fallback percentage: 1.35% Iteration 64: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 67: swpout inc: 221, swpout fallback inc: 1, Fallback percentage: 0.45% Iteration 75: swpout inc: 220, swpout fallback inc: 9, Fallback percentage: 3.93% Iteration 82: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 86: swpout inc: 211, swpout fallback inc: 12, Fallback percentage: 5.38% Iteration 89: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88% Iteration 93: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45% Iteration 94: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 96: swpout inc: 221, swpout fallback inc: 6, Fallback percentage: 2.64% Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 99: swpout inc: 227, swpout fallback inc: 3, Fallback percentage: 1.30%
$ ./thp_swap_allocator_test ./thp_swap_allocator_test Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53% Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58% Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34% Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51% Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84% Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91% Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05% Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25% Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74% Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01% Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45% Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98% Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64% Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36% Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02% Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
$ ./thp_swap_allocator_test -s ./thp_swap_allocator_test -s Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 97, swpout fallback inc: 135, Fallback percentage: 58.19% Iteration 3: swpout inc: 42, swpout fallback inc: 192, Fallback percentage: 82.05% Iteration 4: swpout inc: 19, swpout fallback inc: 214, Fallback percentage: 91.85% Iteration 5: swpout inc: 12, swpout fallback inc: 213, Fallback percentage: 94.67% Iteration 6: swpout inc: 11, swpout fallback inc: 217, Fallback percentage: 95.18% Iteration 7: swpout inc: 9, swpout fallback inc: 214, Fallback percentage: 95.96% Iteration 8: swpout inc: 8, swpout fallback inc: 213, Fallback percentage: 96.38% Iteration 9: swpout inc: 2, swpout fallback inc: 223, Fallback percentage: 99.11% Iteration 10: swpout inc: 2, swpout fallback inc: 228, Fallback percentage: 99.13% Iteration 11: swpout inc: 4, swpout fallback inc: 214, Fallback percentage: 98.17% Iteration 12: swpout inc: 5, swpout fallback inc: 226, Fallback percentage: 97.84% Iteration 13: swpout inc: 3, swpout fallback inc: 212, Fallback percentage: 98.60% Iteration 14: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00% Iteration 15: swpout inc: 3, swpout fallback inc: 222, Fallback percentage: 98.67% Iteration 16: swpout inc: 4, swpout fallback inc: 223, Fallback percentage: 98.24%
========= Kernel compile under tmpfs with cgroup memory.max = 470M. 12 core 24 hyperthreading, 32 jobs. 10 Run each group
SSD swap 10 runs average, 20G swap partition: With: user 2929.064 system 1479.381 : 1376.89 1398.22 1444.64 1477.39 1479.04 1497.27 1504.47 1531.4 1532.92 1551.57 real 1441.324
Without: user 2910.872 system 1482.732 : 1440.01 1451.4 1462.01 1467.47 1467.51 1469.3 1470.19 1496.32 1544.1 1559.01 real 1580.822
Two zram swap: zram0 3.0G zram1 20G.
The idea is forcing the zram0 almost full then overflow to zram1:
With: user 4320.301 system 4272.403 : 4236.24 4262.81 4264.75 4269.13 4269.44 4273.06 4279.85 4285.98 4289.64 4293.13 real 431.759
Without user 4301.393 system 4387.672 : 4374.47 4378.3 4380.95 4382.84 4383.06 4388.05 4389.76 4397.16 4398.23 4403.9 real 433.979
------ more test result from Kaiui ----------
Test with build linux kernel using a 4G ZRAM, 1G memory.max limit on top of shmem:
System info: 32 Core AMD Zen2, 64G total memory.
Test 3 times using only 4K pages: =================================
With: ----- 1838.74user 2411.21system 2:37.86elapsed 2692%CPU (0avgtext+0avgdata 847060maxresident)k 1839.86user 2465.77system 2:39.35elapsed 2701%CPU (0avgtext+0avgdata 847060maxresident)k 1840.26user 2454.68system 2:39.43elapsed 2693%CPU (0avgtext+0avgdata 847060maxresident)k
Summary (~4.6% improment of system time): User: 1839.62 System: 2443.89: 2465.77 2454.68 2411.21 Real: 158.88
Without: -------- 1837.99user 2575.95system 2:43.09elapsed 2706%CPU (0avgtext+0avgdata 846520maxresident)k 1838.32user 2555.15system 2:42.52elapsed 2709%CPU (0avgtext+0avgdata 846520maxresident)k 1843.02user 2561.55system 2:43.35elapsed 2702%CPU (0avgtext+0avgdata 846520maxresident)k
Summary: User: 1839.78 System: 2564.22: 2575.95 2555.15 2561.55 Real: 162.99
Test 5 times using enabled all mTHP pages: ==========================================
With: ----- 1796.44user 2937.33system 2:59.09elapsed 2643%CPU (0avgtext+0avgdata 846936maxresident)k 1802.55user 3002.32system 2:54.68elapsed 2750%CPU (0avgtext+0avgdata 847072maxresident)k 1806.59user 2986.53system 2:55.17elapsed 2736%CPU (0avgtext+0avgdata 847092maxresident)k 1803.27user 2982.40system 2:54.49elapsed 2742%CPU (0avgtext+0avgdata 846796maxresident)k 1807.43user 3036.08system 2:56.06elapsed 2751%CPU (0avgtext+0avgdata 846488maxresident)k
Summary (~8.4% improvement of system time): User: 1803.25 System: 2988.93: 2937.33 3002.32 2986.53 2982.40 3036.08 Real: 175.90
mTHP swapout status: /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout:347721 /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout_fallback:3110 /sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout:3365 /sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout_fallback:8269 /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout:24 /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout_fallback:3341 /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout:145 /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout_fallback:5038 /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout:322737 /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback:36808 /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout:380455 /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout_fallback:1010 /sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout:24973 /sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout_fallback:13223 /sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout:197348 /sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout_fallback:80541
Without: -------- 1794.41user 3151.29system 3:05.97elapsed 2659%CPU (0avgtext+0avgdata 846704maxresident)k 1810.27user 3304.48system 3:05.38elapsed 2759%CPU (0avgtext+0avgdata 846636maxresident)k 1809.84user 3254.85system 3:03.83elapsed 2755%CPU (0avgtext+0avgdata 846952maxresident)k 1813.54user 3259.56system 3:04.28elapsed 2752%CPU (0avgtext+0avgdata 846848maxresident)k 1829.97user 3338.40system 3:07.32elapsed 2759%CPU (0avgtext+0avgdata 847024maxresident)k
Summary: User: 1811.61 System: 3261.72 : 3151.29 3304.48 3254.85 3259.56 3338.40 Real: 185.356
mTHP swapout status: hugepages-32kB/stats/swpout:35630 hugepages-32kB/stats/swpout_fallback:1809908 hugepages-512kB/stats/swpout:523 hugepages-512kB/stats/swpout_fallback:55235 hugepages-2048kB/stats/swpout:53 hugepages-2048kB/stats/swpout_fallback:17264 hugepages-1024kB/stats/swpout:85 hugepages-1024kB/stats/swpout_fallback:24979 hugepages-64kB/stats/swpout:30117 hugepages-64kB/stats/swpout_fallback:1825399 hugepages-16kB/stats/swpout:42775 hugepages-16kB/stats/swpout_fallback:1951123 hugepages-256kB/stats/swpout:2326 hugepages-256kB/stats/swpout_fallback:170165 hugepages-128kB/stats/swpout:17925 hugepages-128kB/stats/swpout_fallback:1309757
This patch (of 9):
Previously, the swap cluster used a cluster index as a pointer to construct a custom single link list type "swap_cluster_list". The next cluster pointer is shared with the cluster->count. It prevents puting the non free cluster into a list.
Change the cluster to use the standard double link list instead. This allows tracing the nonfull cluster in the follow up patch. That way, it is faster to get to the nonfull cluster of that order.
Remove the cluster getter/setter for accessing the cluster struct member.
The list operation is protected by the swap_info_struct->lock.
Change cluster code to use "struct swap_cluster_info *" to reference the cluster rather than by using index. That is more consistent with the list manipulation. It avoids the repeat adding index to the cluser_info. The code is easier to understand.
Remove the cluster next pointer is NULL flag, the double link list can handle the empty list pretty well.
The "swap_cluster_info" struct is two pointer bigger, because 512 swap entries share one swap_cluster_info struct, it has very little impact on the average memory usage per swap entry. For 1TB swapfile, the swap cluster data structure increases from 8MB to 24MB.
Other than the list conversion, there is no real function change in this patch.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chris Li <[email protected]> Reported-by: Barry Song <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|