History log of /linux-6.15/include/linux/swap.h (Results 1 – 25 of 417)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7
# b487a2da 13-Mar-2025 Kairui Song <[email protected]>

mm, swap: simplify folio swap allocation

With slot cache gone, clean up the allocation helpers even more.
folio_alloc_swap will be the only entry for allocation and adding the
folio to swap cache (

mm, swap: simplify folio swap allocation

With slot cache gone, clean up the allocation helpers even more.
folio_alloc_swap will be the only entry for allocation and adding the
folio to swap cache (except suspend), making it opposite of
folio_free_swap.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 0ff67f99 13-Mar-2025 Kairui Song <[email protected]>

mm, swap: remove swap slot cache

Slot cache is no longer needed now, removing it and all related code.

- vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`,
12G memory cgroup using simula

mm, swap: remove swap slot cache

Slot cache is no longer needed now, removing it and all related code.

- vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`,
12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs),
16 test runs for each case, measuring the total throughput:

Before (KB/s) (stdev) After (KB/s) (stdev)
Random (4K): 424907.60 (24410.78) 414745.92 (34554.78)
Random (64K): 163308.82 (11635.72) 167314.50 (18434.99)
Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16)

The performance changes are below noise level.

- Build linux kernel with make -j96, using 4K folio with 1.5G memory
cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs,
12 test runs, measuring the system time:

Before (s) (stdev) After (s) (stdev)
make -j96 (4K): 6445.69 (61.95) 6408.80 (69.46)
make -j96 (64K): 6841.71 (409.04) 6437.99 (435.55)

Similar to above, 64k mTHP case showed a slight improvement.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 1b7e9002 13-Mar-2025 Kairui Song <[email protected]>

mm, swap: use percpu cluster as allocation fast path

Current allocation workflow first traverses the plist with a global lock
held, after choosing a device, it uses the percpu cluster on that swap
d

mm, swap: use percpu cluster as allocation fast path

Current allocation workflow first traverses the plist with a global lock
held, after choosing a device, it uses the percpu cluster on that swap
device. This commit moves the percpu cluster variable out of being tied
to individual swap devices, making it a global percpu variable, and will
be used directly for allocation as a fast path.

The global percpu cluster variable will never point to a HDD device, and
allocations on a HDD device are still globally serialized.

This improves the allocator performance and prepares for removal of the
slot cache in later commits. There shouldn't be much observable behavior
change, except one thing: this changes how swap device allocation rotation
works.

Currently, each allocation will rotate the plist, and because of the
existence of slot cache (one order 0 allocation usually returns 64
entries), swap devices of the same priority are rotated for every 64 order
0 entries consumed. High order allocations are different, they will
bypass the slot cache, and so swap device is rotated for every 16K, 32K,
or up to 2M allocation.

The rotation rule was never clearly defined or documented, it was changed
several times without mentioning.

After this commit, and once slot cache is gone in later commits, swap
device rotation will happen for every consumed cluster. Ideally non-HDD
devices will be rotated if 2M space has been consumed for each order.
Fragmented clusters will rotate the device faster, which seems OK. HDD
devices is rotated for every allocation regardless of the allocation
order, which should be OK too and trivial.

This commit also slightly changes allocation behaviour for slot cache.
The new added cluster allocation fast path may allocate entries from
different device to the slot cache, this is not observable from user
space, only impact performance very slightly, and slot cache will be just
gone in next commit, so this can be ignored.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2
# c523aa89 05-Feb-2025 Baoquan He <[email protected]>

mm/swap: rename swap_swapcount() to swap_entry_swapped()

The new function name can reflect the real behaviour of the function more
clearly and more accurately. And the renaming avoids the confusion

mm/swap: rename swap_swapcount() to swap_entry_swapped()

The new function name can reflect the real behaviour of the function more
clearly and more accurately. And the renaming avoids the confusion
between swap_swapcount() and swp_swapcount().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Kairui Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 0eb7d2c3 05-Feb-2025 Baoquan He <[email protected]>

mm/swap: remove SWAP_FLAG_PRIO_SHIFT

It doesn't make sense to have a zero value of shift. Remove it to avoid
confusion.

Link: https://lkml.kernel.org/r/[email protected]
Signed-

mm/swap: remove SWAP_FLAG_PRIO_SHIFT

It doesn't make sense to have a zero value of shift. Remove it to avoid
confusion.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Kairui Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# c25465eb 10-Feb-2025 David Hildenbrand <[email protected]>

mm: use single SWP_DEVICE_EXCLUSIVE entry type

There is no need for the distinction anymore; let's merge the readable and
writable device-exclusive entries into a single device-exclusive entry
type.

mm: use single SWP_DEVICE_EXCLUSIVE entry type

There is no need for the distinction anymore; let's merge the readable and
writable device-exclusive entries into a single device-exclusive entry
type.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Simona Vetter <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Tested-by: Alistair Popple <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Karol Herbst <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Lyude <[email protected]>
Cc: "Masami Hiramatsu (Google)" <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: Barry Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14-rc1
# 89ce924f 24-Jan-2025 Johannes Weiner <[email protected]>

mm: memcontrol: move memsw charge callbacks to v1

The interweaving of two entirely different swap accounting strategies has
been one of the more confusing parts of the memcg code. Split out the v1

mm: memcontrol: move memsw charge callbacks to v1

The interweaving of two entirely different swap accounting strategies has
been one of the more confusing parts of the memcg code. Split out the v1
code to clarify the implementation and a handful of callsites, and to
avoid building the v1 bits when !CONFIG_MEMCG_V1.

text data bss dec hex filename
39253 6446 4160 49859 c2c3 mm/memcontrol.o.old
38877 6382 4160 49419 c10b mm/memcontrol.o

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.13, v6.13-rc7
# 538d5baa 11-Jan-2025 Kaixiong Yu <[email protected]>

mm: vmscan: move vmscan sysctls to mm/vmscan.c

This moves vm_swappiness and zone_reclaim_mode to mm/vmscan.c,
as part of the kernel/sysctl.c cleaning, also moves some external
variable declarations

mm: vmscan: move vmscan sysctls to mm/vmscan.c

This moves vm_swappiness and zone_reclaim_mode to mm/vmscan.c,
as part of the kernel/sysctl.c cleaning, also moves some external
variable declarations and function declarations from include/linux/swap.h
into mm/internal.h.

Signed-off-by: Kaixiong Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Joel Granados <[email protected]>

show more ...


Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1
# 1c7b17cf 19-Nov-2024 liuye <[email protected]>

mm/vmscan: fix hard LOCKUP in function isolate_lru_folios

This fixes the following hard lockup in isolate_lru_folios() during memory
reclaim. If the LRU mostly contains ineligible folios this may t

mm/vmscan: fix hard LOCKUP in function isolate_lru_folios

This fixes the following hard lockup in isolate_lru_folios() during memory
reclaim. If the LRU mostly contains ineligible folios this may trigger
watchdog.

watchdog: Watchdog detected hard LOCKUP on cpu 173
RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0
Call Trace:
_raw_spin_lock_irqsave+0x31/0x40
folio_lruvec_lock_irqsave+0x5f/0x90
folio_batch_move_lru+0x91/0x150
lru_add_drain_per_cpu+0x1c/0x40
process_one_work+0x17d/0x350
worker_thread+0x27b/0x3a0
kthread+0xe8/0x120
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1b/0x30

lruvec->lru_lock owner:

PID: 2865 TASK: ffff888139214d40 CPU: 40 COMMAND: "kswapd0"
#0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555
#1 [fffffe0000945e68] nmi_handle at ffffffffa563b171
#2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920
#3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4
#4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde
[exception RIP: isolate_lru_folios+403]
RIP: ffffffffa597df53 RSP: ffffc90006fb7c28 RFLAGS: 00000002
RAX: 0000000000000001 RBX: ffffc90006fb7c60 RCX: ffffea04a2196f88
RDX: ffffc90006fb7c60 RSI: ffffc90006fb7c60 RDI: ffffea04a2197048
RBP: ffff88812cbd3010 R8: ffffea04a2197008 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffffea04a2197008
R13: ffffea04a2197048 R14: ffffc90006fb7de8 R15: 0000000003e3e937
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
<NMI exception stack>
#5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
#6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788
#7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0
#8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354
#9 [ffffc90006fb7ef8] kthread at ffffffffa5748238
crash>

Scenario:
User processe are requesting a large amount of memory and keep page active.
Then a module continuously requests memory from ZONE_DMA32 area.
Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached.
However pages in the LRU(active_anon) list are mostly from
the ZONE_NORMAL area.

Reproduce:
Terminal 1: Construct to continuously increase pages active(anon).
mkdir /tmp/memory
mount -t tmpfs -o size=1024000M tmpfs /tmp/memory
dd if=/dev/zero of=/tmp/memory/block bs=4M
tail /tmp/memory/block

Terminal 2:
vmstat -a 1
active will increase.
procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ...
r b swpd free inact active si so bi bo
1 0 0 1445623076 45898836 83646008 0 0 0
1 0 0 1445623076 43450228 86094616 0 0 0
1 0 0 1445623076 41003480 88541364 0 0 0
1 0 0 1445623076 38557088 90987756 0 0 0
1 0 0 1445623076 36109688 93435156 0 0 0
1 0 0 1445619552 33663256 95881632 0 0 0
1 0 0 1445619804 31217140 98327792 0 0 0
1 0 0 1445619804 28769988 100774944 0 0 0
1 0 0 1445619804 26322348 103222584 0 0 0
1 0 0 1445619804 23875592 105669340 0 0 0

cat /proc/meminfo | head
Active(anon) increase.
MemTotal: 1579941036 kB
MemFree: 1445618500 kB
MemAvailable: 1453013224 kB
Buffers: 6516 kB
Cached: 128653956 kB
SwapCached: 0 kB
Active: 118110812 kB
Inactive: 11436620 kB
Active(anon): 115345744 kB
Inactive(anon): 945292 kB

When the Active(anon) is 115345744 kB, insmod module triggers
the ZONE_DMA32 watermark.

perf record -e vmscan:mm_vmscan_lru_isolate -aR
perf script
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2
nr_skipped=2 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0
nr_skipped=0 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844
nr_skipped=28835844 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844
nr_skipped=28835844 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29
nr_skipped=29 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0
nr_skipped=0 nr_taken=0 lru=active_anon

See nr_scanned=28835844.
28835844 * 4k = 115343376KB approximately equal to 115345744 kB.

If increase Active(anon) to 1000G then insmod module triggers
the ZONE_DMA32 watermark. hard lockup will occur.

In my device nr_scanned = 0000000003e3e937 when hard lockup.
Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB.

[ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
ffffc90006fb7c30: 0000000000000020 0000000000000000
ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000
ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8
ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48
ffffc90006fb7c70: 0000000000000000 0000000000000000
ffffc90006fb7c80: 0000000000000000 0000000000000000
ffffc90006fb7c90: 0000000000000000 0000000000000000
ffffc90006fb7ca0: 0000000000000000 0000000003e3e937
ffffc90006fb7cb0: 0000000000000000 0000000000000000
ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000

About the Fixes:
Why did it take eight years to be discovered?

The problem requires the following conditions to occur:
1. The device memory should be large enough.
2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area.
3. The memory in ZONE_DMA32 needs to reach the watermark.

If the memory is not large enough, or if the usage design of ZONE_DMA32
area memory is reasonable, this problem is difficult to detect.

notes:
The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL,
but other suitable scenarios may also trigger the problem.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
Signed-off-by: liuye <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# bae8a4ef 13-Jan-2025 Kairui Song <[email protected]>

mm, swap: use a global swap cluster for non-rotation devices

Non-rotational devices (SSD / ZRAM) can tolerate fragmentation, so the
goal of the SWAP allocator is to avoid contention for clusters. I

mm, swap: use a global swap cluster for non-rotation devices

Non-rotational devices (SSD / ZRAM) can tolerate fragmentation, so the
goal of the SWAP allocator is to avoid contention for clusters. It uses a
per-CPU cluster design, and each CPU will use a different cluster as much
as possible.

However, HDDs are very sensitive to fragmentation, contention is trivial
in comparison. Therefore, we use one global cluster instead. This
ensures that each order will be written to the same cluster as much as
possible, which helps make the I/O more continuous.

This ensures that the performance of the cluster allocator is as good as
that of the old allocator. Tests after this commit compared to those
before this series:

Tested using 'make -j32' with tinyconfig, a 1G memcg limit, and HDD swap:

make -j32 with tinyconfig, using 1G memcg limit and HDD swap:

Before this series:
114.44user 29.11system 39:42.90elapsed 6%CPU (0avgtext+0avgdata 157284maxresident)k
2901232inputs+0outputs (238877major+4227640minor)pagefaults

After this commit:
113.90user 23.81system 38:11.77elapsed 6%CPU (0avgtext+0avgdata 157260maxresident)k
2548728inputs+0outputs (235471major+4238110minor)pagefaults

[[email protected]: check kmalloc() return in setup_clusters]
Link: https://lkml.kernel.org/r/CAMgjq7Au+o04ckHyT=iU-wVx9az=t0B-ZiC5E0bDqNrAtNOP-g@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Suggested-by: Chris Li <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickens <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# e3ae2dec 13-Jan-2025 Kairui Song <[email protected]>

mm, swap: simplify percpu cluster updating

Instead of using a returning argument, we can simply store the next
cluster offset to the fixed percpu location, which reduce the stack usage
and simplify

mm, swap: simplify percpu cluster updating

Instead of using a returning argument, we can simply store the next
cluster offset to the fixed percpu location, which reduce the stack usage
and simplify the function:

Object size:
./scripts/bloat-o-meter mm/swapfile.o mm/swapfile.o.new
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-271 (-271)
Function old new delta
get_swap_pages 2847 2733 -114
alloc_swap_scan_cluster 894 737 -157
Total: Before=30833, After=30562, chg -0.88%

Stack usage:
Before:
swapfile.c:1190:5:get_swap_pages 240 static

After:
swapfile.c:1185:5:get_swap_pages 216 static

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chis Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickens <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 3b644773 13-Jan-2025 Kairui Song <[email protected]>

mm, swap: reduce contention on device lock

Currently, swap locking is mainly composed of two locks: the cluster lock
(ci->lock) and the device lock (si->lock).

The cluster lock is much more fine-gr

mm, swap: reduce contention on device lock

Currently, swap locking is mainly composed of two locks: the cluster lock
(ci->lock) and the device lock (si->lock).

The cluster lock is much more fine-grained, so it is best to use ci->lock
instead of si->lock as much as possible.

We have cleaned up other hard dependencies on si->lock. Following the new
cluster allocator design, most operations don't need to touch si->lock at
all. In practice, we only need to take si->lock when moving clusters
between lists.

To achieve this, this commit reworks the locking pattern of all si->lock
and ci->lock users, eliminates all usage of ci->lock inside si->lock, and
introduces a new design to avoid touching si->lock unless needed.

For minimal contention and easier understanding of the system, two ideas
are introduced with the corresponding helpers: isolation and relocation.

- Clusters will be `isolated` from the list when iterating the list
to search for an allocatable cluster.

This ensures other CPUs won't walk into the same cluster easily,
and it releases si->lock after acquiring ci->lock, providing the
only place that handles the inversion of two locks, and avoids
contention.

Iterating the cluster list almost always moves the cluster
(free -> nonfull, nonfull -> frag, frag -> frag tail), but it
doesn't know where the cluster should be moved to until scanning
is done. So keeping the cluster off-list is a good option with
low overhead.

The off-list time window of a cluster is also minimal. In the worst
case, one CPU will return the cluster after scanning the 512 entries
on it, which we used to busy wait with a spin lock.

This is done with the new helper `isolate_lock_cluster`.

- Clusters will be `relocated` after allocation or freeing, according
to their usage count and status.

Allocations no longer hold si->lock now, and may drop ci->lock for
reclaim, so the cluster could be moved to any location while no lock
is held. Besides, isolation clears all flags when it takes the
cluster off the list (the flags must be in sync with the list status,
so cluster users don't need to touch si->lock for checking its list
status). So the cluster has to be relocated to the right list
according to its usage after allocation or freeing.

Relocation is optional, if the cluster flags indicate it's already
on the right list, it will skip touching the list or si->lock.

This is done with `relocate_cluster` after allocation or with
`[partial_]free_cluster` after freeing.

This handled usage of all kinds of clusters in a clean way.

Scanning and allocation by iterating the cluster list is handled by
"isolate - <scan / allocate> - relocate".

Scanning and allocation of per-CPU clusters will only involve
"<scan / allocate> - relocate", as it knows which cluster to lock
and use.

Freeing will only involve "relocate".

Each CPU will keep using its per-CPU cluster until the 512 entries
are all consumed. Freeing also has to free 512 entries to trigger
cluster movement in the best case, so si->lock is rarely touched.

Testing with building the Linux kernel with defconfig showed huge
improvement:

tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C:
Before:
Sys time: 73578.30, Real time: 864.05
After: (-50.7% sys time, -44.8% real time)
Sys time: 36227.49, Real time: 476.66

time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C:
(avg of 4 test run)
Before:
Sys time: 74044.85, Real time: 846.51
hugepages-64kB/stats/swpout: 1735216
hugepages-64kB/stats/swpout_fallback: 430333

After: (-40.4% sys time, -37.1% real time)
Sys time: 44160.56, Real time: 532.07
hugepages-64kB/stats/swpout: 1786288
hugepages-64kB/stats/swpout_fallback: 243384

time make -j32 / 512M memcg, 4K pages, 5G ZRAM, on AMD 7K62:
Before:
Sys time: 8098.21, Real time: 401.3
After: (-22.6% sys time, -12.8% real time )
Sys time: 6265.02, Real time: 349.83

The allocation success rate also slightly improved as we sanitized the
usage of clusters with new defined helpers, previously dropping
si->lock or ci->lock during scan will cause cluster order shuffle.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Suggested-by: Chris Li <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickens <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 3494d184 13-Jan-2025 Kairui Song <[email protected]>

mm, swap: use an enum to define all cluster flags and wrap flags changes

Currently, we are only using flags to indicate which list the cluster is
on. Using one bit for each list type might be a was

mm, swap: use an enum to define all cluster flags and wrap flags changes

Currently, we are only using flags to indicate which list the cluster is
on. Using one bit for each list type might be a waste, as the list type
grows, we will consume too many bits. Additionally, the current mixed
usage of '&' and '==' is a bit confusing.

Make it clean by using an enum to define all possible cluster statuses.
Only an off-list cluster will have the NONE (0) flag. And use a wrapper
to annotate and sanitize all flag settings and list movements.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Suggested-by: Chris Li <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickens <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 9a0ddeb7 13-Jan-2025 Kairui Song <[email protected]>

mm, swap: hold a reference during scan and cleanup flag usage

The flag SWP_SCANNING was used as an indicator of whether a device is
being scanned for allocation, and prevents swapoff. Combined with

mm, swap: hold a reference during scan and cleanup flag usage

The flag SWP_SCANNING was used as an indicator of whether a device is
being scanned for allocation, and prevents swapoff. Combined with
SWP_WRITEOK, they work as a set of barriers for a clean swapoff:

1. Swapoff clears SWP_WRITEOK, allocation requests will see
~SWP_WRITEOK and abort as it's serialized by si->lock.
2. Swapoff unuses all allocated entries.
3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing
allocations will stop, preventing UAF.
4. Now swapoff can free everything safely.

This will make the allocation path have a hard dependency on si->lock.
Allocation always have to acquire si->lock first for setting SWP_SCANNING
and checking SWP_WRITEOK.

This commit removes this flag, and just uses the existing per-CPU refcount
instead to prevent UAF in step 3, which serves well for such usage without
dependency on si->lock, and scales very well too. Just hold a reference
during the whole scan and allocation process. Swapoff will kill and wait
for the counter.

And for preventing any allocation from happening after step 1 so the unuse
in step 2 can ensure all slots are free, swapoff will acquire the ci->lock
of each cluster one by one to ensure all allocations see ~SWP_WRITEOK and
abort.

This way these dependences on si->lock are gone. And worth noting we
can't kill the refcount as the first step for swapoff as the unuse process
have to acquire the refcount.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chis Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickens <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# b228386c 13-Jan-2025 Kairui Song <[email protected]>

mm, swap: clean up plist removal and adding

When the swap device is full (inuse_pages == pages), it should be removed
from the allocation available plist. If any slot is freed, the swap
device shou

mm, swap: clean up plist removal and adding

When the swap device is full (inuse_pages == pages), it should be removed
from the allocation available plist. If any slot is freed, the swap
device should be added back to the plist. Additionally, during swapon or
swapoff, the swap device is forcefully added or removed.

Currently, the condition (inuse_pages == pages) is checked after every
counter update, then remove or add the device accordingly. This is
serialized by si->lock.

This commit decouples it from the protection of si->lock and reworked
plist removal and adding, making it possible to get rid of the hard
dependency on si->lock in allocation path in later commits.

To achieve this, simply using another lock is not an optimal approach, as
the overhead is observable for a hot counter, and may cause complex
locking issues. Thus, this commit manages to make it a lock-free atomic
operation, by embedding the plist state into the second highest bit of the
atomic counter.

Simply making the counter an atomic will not work, if the update and plist
status check are not performed atomically, we may miss an addition or
removal. With the embedded info we can update the counter and check the
plist status with single atomic operations, and avoid any extra overheads:

If the counter is full (inuse_pages == pages) and the off-list bit is
unset, we attempt to remove it from the plist. If the counter is not full
(inuse_pages != pages) and the off-list bit is set, we attempt to add it
to the plist. Removing, adding and bit update is serialized with a lock,
which is a cold path. Ordinary counter updates will be lock-free.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chis Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickens <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 27701521 13-Jan-2025 Kairui Song <[email protected]>

mm, swap: clean up device availability check

Remove highest_bit and lowest_bit. After the HDD allocation path has been
removed, the only purpose of these two fields is to determine whether the
devi

mm, swap: clean up device availability check

Remove highest_bit and lowest_bit. After the HDD allocation path has been
removed, the only purpose of these two fields is to determine whether the
device is full or not, which can instead be determined by checking the
inuse_pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chis Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickens <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 72774330 13-Jan-2025 Kairui Song <[email protected]>

mm, swap: remove old allocation path for HDD

We are currently using different swap allocation algorithm for HDD and
non-HDD. This leads to the existence of a different set of locks, and the
code pa

mm, swap: remove old allocation path for HDD

We are currently using different swap allocation algorithm for HDD and
non-HDD. This leads to the existence of a different set of locks, and the
code path is heavily bloated, causing difficulties for further
optimization and maintenance.

This commit removes all HDD swap allocation and related dead code, and
uses the cluster allocation algorithm instead.

The performance may drop temporarily, but this should be negligible: The
main advantage of the legacy HDD allocation algorithm is that it tends to
use continuous slots, but swap device gets fragmented quickly anyway, and
the attempt to use continuous slots will fail easily.

This commit also enables mTHP swap on HDD, which is expected to be
beneficial, and following commits will adapt and optimize the cluster
allocator for HDD.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Suggested-by: Chris Li <[email protected]>
Suggested-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Hugh Dickens <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5
# 5168a68e 22-Oct-2024 Kairui Song <[email protected]>

mm, swap: avoid over reclaim of full clusters

When running low on usable slots, cluster allocator will try to reclaim
the full clusters aggressively to reclaim HAS_CACHE slots. This
guarantees that

mm, swap: avoid over reclaim of full clusters

When running low on usable slots, cluster allocator will try to reclaim
the full clusters aggressively to reclaim HAS_CACHE slots. This
guarantees that as long as there are any usable slots, HAS_CACHE or not,
the swap device will be usable and workload won't go OOM early.

Before the cluster allocator, swap allocator fails easily if device is
filled up with reclaimable HAS_CACHE slots. Which can be easily
reproduced with following simple program:

#include <stdio.h>
#include <string.h>
#include <linux/mman.h>
#include <sys/mman.h>
#define SIZE 8192UL * 1024UL * 1024UL
int main(int argc, char **argv) {
long tmp;
char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
memset(p, 0, SIZE);
madvise(p, SIZE, MADV_PAGEOUT);
for (unsigned long i = 0; i < SIZE; ++i)
tmp += p[i];
getchar(); /* Pause */
return 0;
}

Setup an 8G non ramdisk swap, the first run of the program will swapout 8G
ram successfully. But run same program again after the first run paused,
the second run can't swapout all 8G memory as now half of the swap device
is pinned by HAS_CACHE. There was a random scan in the old allocator that
may reclaim part of the HAS_CACHE by luck, but it's unreliable.

The new allocator's added reclaim of full clusters when device is low on
usable slots. But when multiple CPUs are seeing the device is low on
usable slots at the same time, they ran into a thundering herd problem.

This is an observable problem on large machine with mass parallel
workload, as full cluster reclaim is slower on large swap device and
higher number of CPUs will also make things worse.

Testing using a 128G ZRAM on a 48c96t system. When the swap device is
very close to full (eg. 124G / 128G), running build linux kernel with
make -j96 in a 1G memory cgroup will hung (not a softlockup though)
spinning in full cluster reclaim for about ~5min before go OOM.

To solve this, split the full reclaim into two parts:

- Instead of do a synchronous aggressively reclaim when device is low,
do only one aggressively reclaim when device is strictly full with a
kworker. This still ensures in worst case the device won't be unusable
because of HAS_CACHE slots.

- To avoid allocation (especially higher order) suffer from HAS_CACHE
filling up clusters and kworker not responsive enough, do one synchronous
scan every time the free list is drained, and only scan one cluster. This
is kind of similar to the random reclaim before, keeps the full clusters
rotated and has a minimal latency. This should provide a fair reclaim
strategy suitable for most workloads.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2cacbdfdee65 ("mm: swap: add a adaptive full cluster cache reclaim")
Signed-off-by: Kairui Song <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5
# 0ca0c24e 23-Aug-2024 Usama Arif <[email protected]>

mm: store zero pages to be swapped out in a bitmap

Patch series "mm: store zero pages to be swapped out in a bitmap", v8.

As shown in the patch series that introduced the zswap same-filled
optimiza

mm: store zero pages to be swapped out in a bitmap

Patch series "mm: store zero pages to be swapped out in a bitmap", v8.

As shown in the patch series that introduced the zswap same-filled
optimization [1], 10-20% of the pages stored in zswap are same-filled.
This is also observed across Meta's server fleet. By using VM counters in
swap_writepage (not included in this patchseries) it was found that less
than 1% of the same-filled pages to be swapped out are non-zero pages.

For conventional swap setup (without zswap), rather than reading/writing
these pages to flash resulting in increased I/O and flash wear, a bitmap
can be used to mark these pages as zero at write time, and the pages can
be filled at read time if the bit corresponding to the page is set.

When using zswap with swap, this also means that a zswap_entry does not
need to be allocated for zero filled pages resulting in memory savings
which would offset the memory used for the bitmap.

A similar attempt was made earlier in [2] where zswap would only track
zero-filled pages instead of same-filled. This patchseries adds
zero-filled pages optimization to swap (hence it can be used even if zswap
is disabled) and removes the same-filled code from zswap (as only 1% of
the same-filled pages are non-zero), simplifying code.

[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
[2] https://lore.kernel.org/lkml/[email protected]/


This patch (of 2):

Approximately 10-20% of pages to be swapped out are zero pages [1].
Rather than reading/writing these pages to flash resulting
in increased I/O and flash wear, a bitmap can be used to mark these
pages as zero at write time, and the pages can be filled at
read time if the bit corresponding to the page is set.
With this patch, NVMe writes in Meta server fleet decreased
by almost 10% with conventional swap setup (zswap disabled).

[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Usama Arif <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.11-rc4
# 65018076 12-Aug-2024 Baolin Wang <[email protected]>

mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting

Patch series "support large folio swap-out and swap-in for shmem", v5.

Shmem will support large folio allocation [1]

mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting

Patch series "support large folio swap-out and swap-in for shmem", v5.

Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap-out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.

Moreover, the swap code already supports for swapping out large folio
without split, and large folio swap-in[3] series is queued into
mm-unstable branch. Hence this patch set also supports the large folio
swap-out and swap-in for shmem.


This patch (of 9):

To support shmem large folio swap operations, add a new parameter to
swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for shmem
swap entries.

While we are at it, using folio_nr_pages() to get the number of pages of
the folio as a preparation.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/99f64115d04b285e009580eb177352c57119ffd0.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Daniel Gomez <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.11-rc3, v6.11-rc2
# 2cacbdfd 31-Jul-2024 Kairui Song <[email protected]>

mm: swap: add a adaptive full cluster cache reclaim

Link all full cluster with one full list, and reclaim from it when the
allocation have ran out of all usable clusters.

There are many reason a fo

mm: swap: add a adaptive full cluster cache reclaim

Link all full cluster with one full list, and reclaim from it when the
allocation have ran out of all usable clusters.

There are many reason a folio can end up being in the swap cache while
having no swap count reference. So the best way to search for such slots
is still by iterating the swap clusters.

With the list as an LRU, iterating from the oldest cluster and keep them
rotating is a very doable and clean way to free up potentially not inuse
clusters.

When any allocation failure, try reclaim and rotate only one cluster.
This is adaptive for high order allocations they can tolerate fallback.
So this avoids latency, and give the full cluster list an fair chance to
get reclaimed. It release the usage stress for the fallback order 0
allocation or following up high order allocation.

If the swap device is getting very full, reclaim more aggresively to
ensure no OOM will happen. This ensures order 0 heavy workload won't go
OOM as order 0 won't fail if any cluster still have any space.

[[email protected]: fix discard of full cluster]
Link: https://lkml.kernel.org/r/CAMgjq7CWwK75_2Zi5P40K08pk9iqOcuWKL6khu=x4Yg_nXaQag@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kairui Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 661383c6 31-Jul-2024 Kairui Song <[email protected]>

mm: swap: relaim the cached parts that got scanned

This commit implements reclaim during scan for cluster allocator.

Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could
result i

mm: swap: relaim the cached parts that got scanned

This commit implements reclaim during scan for cluster allocator.

Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could
result in low allocation success rate or early OOM.

So to ensure maximum allocation success rate, integrate reclaiming with
scanning. If found a range of suitable swap slots but fragmented due to
HAS_CACHE, just try to reclaim the slots.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 477cb7ba 31-Jul-2024 Kairui Song <[email protected]>

mm: swap: add a fragment cluster list

Now swap cluster allocator arranges the clusters in LRU style, so the
"cold" cluster stay at the head of nonfull lists are the ones that were
used for allocatio

mm: swap: add a fragment cluster list

Now swap cluster allocator arranges the clusters in LRU style, so the
"cold" cluster stay at the head of nonfull lists are the ones that were
used for allocation long time ago and still partially occupied. So if
allocator can't find enough contiguous slots to satisfy an high order
allocation, it's unlikely there will be slot being free on them to satisfy
the allocation, at least in a short period.

As a result, nonfull cluster scanning will waste time repeatly scanning
the unusable head of the list.

Also, multiple CPUs could content on the same head cluster of nonfull
list. Unlike free clusters which are removed from the list when any CPU
starts using it, nonfull cluster stays on the head.

So introduce a new list frag list, all scanned nonfull clusters will be
moved to this list. Both for avoiding repeated scanning and contention.

Frag list is still used as fallback for allocations, so if one CPU failed
to allocate one order of slots, it can still steal other CPU's clusters.
And order 0 will favor the fragmented clusters to better protect nonfull
clusters

If any slots on a fragment list are being freed, move the fragment list
back to nonfull list indicating it worth another scan on the cluster.
Compared to scan upon freeing a slot, this keep the scanning lazy and save
some CPU if there are still other clusters to use.

It may seems unneccessay to keep the fragmented cluster on list at all if
they can't be used for specific order allocation. But this will start to
make sense once reclaim dring scanning is ready.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# d07a46a4 31-Jul-2024 Chris Li <[email protected]>

mm: swap: mTHP allocate swap entries from nonfull list

Track the nonfull cluster as well as the empty cluster on lists. Each
order has one nonfull cluster list.

The cluster will remember which ord

mm: swap: mTHP allocate swap entries from nonfull list

Track the nonfull cluster as well as the empty cluster on lists. Each
order has one nonfull cluster list.

The cluster will remember which order it was used during new cluster
allocation.

When the cluster has free entry, add to the nonfull[order] list.  When
the free cluster list is empty, also allocate from the nonempty list of
that order.

This improves the mTHP swap allocation success rate.

There are limitations if the distribution of numbers of different orders
of mTHP changes a lot. e.g. there are a lot of nonfull cluster assign to
order A while later time there are a lot of order B allocation while very
little allocation in order A. Currently the cluster used by order A will
not reused by order B unless the cluster is 100% empty.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Li <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 73ed0baa 31-Jul-2024 Chris Li <[email protected]>

mm: swap: swap cluster switch to double link list

Patch series "mm: swap: mTHP swap allocator base on swap cluster order",
v5.

This is the short term solutions "swap cluster order" listed in my "Sw

mm: swap: swap cluster switch to double link list

Patch series "mm: swap: mTHP swap allocator base on swap cluster order",
v5.

This is the short term solutions "swap cluster order" listed in my "Swap
Abstraction" discussion slice 8 in the recent LSF/MM conference.

When commit 845982eb264bc "mm: swap: allow storage of all mTHP orders" is
introduced, it only allocates the mTHP swap entries from the new empty
cluster list.  It has a fragmentation issue reported by Barry.

https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/

The reason is that all the empty clusters have been exhausted while there
are plenty of free swap entries in the cluster that are not 100% free.

Remember the swap allocation order in the cluster. Keep track of the per
order non full cluster list for later allocation.

This series gives the swap SSD allocation a new separate code path from
the HDD allocation. The new allocator use cluster list only and do not
global scan swap_map[] without lock any more.

This streamline the swap allocation for SSD. The code matches the
execution flow much better.

User impact: For users that allocate and free mix order mTHP swapping, It
greatly improves the success rate of the mTHP swap allocation after the
initial phase.

It also performs faster when the swapfile is close to full, because the
allocator can get the non full cluster from a list rather than scanning a
lot of swap_map entries. 

With Barry's mthp test program V2:

Without:
$ ./thp_swap_allocator_test -a
Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71%
Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
...
Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test -a -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

With: # with all 0.00% filter out
$ ./thp_swap_allocator_test -a | grep -v "0.00%"
$ # all result are 0.00%

$ ./thp_swap_allocator_test -a -s | grep -v "0.00%"
./thp_swap_allocator_test -a -s | grep -v "0.00%"
Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
Iteration 19: swpout inc: 219, swpout fallback inc: 7, Fallback percentage: 3.10%
Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 29: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 34: swpout inc: 220, swpout fallback inc: 8, Fallback percentage: 3.51%
Iteration 35: swpout inc: 222, swpout fallback inc: 11, Fallback percentage: 4.72%
Iteration 38: swpout inc: 217, swpout fallback inc: 4, Fallback percentage: 1.81%
Iteration 40: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63%
Iteration 42: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90%
Iteration 43: swpout inc: 215, swpout fallback inc: 7, Fallback percentage: 3.15%
Iteration 47: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 49: swpout inc: 217, swpout fallback inc: 1, Fallback percentage: 0.46%
Iteration 52: swpout inc: 221, swpout fallback inc: 8, Fallback percentage: 3.49%
Iteration 56: swpout inc: 224, swpout fallback inc: 4, Fallback percentage: 1.75%
Iteration 58: swpout inc: 214, swpout fallback inc: 5, Fallback percentage: 2.28%
Iteration 62: swpout inc: 220, swpout fallback inc: 3, Fallback percentage: 1.35%
Iteration 64: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 67: swpout inc: 221, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 75: swpout inc: 220, swpout fallback inc: 9, Fallback percentage: 3.93%
Iteration 82: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 86: swpout inc: 211, swpout fallback inc: 12, Fallback percentage: 5.38%
Iteration 89: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 93: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 94: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 96: swpout inc: 221, swpout fallback inc: 6, Fallback percentage: 2.64%
Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 99: swpout inc: 227, swpout fallback inc: 3, Fallback percentage: 1.30%

$ ./thp_swap_allocator_test
./thp_swap_allocator_test
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%

$ ./thp_swap_allocator_test -s
./thp_swap_allocator_test -s
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 97, swpout fallback inc: 135, Fallback percentage: 58.19%
Iteration 3: swpout inc: 42, swpout fallback inc: 192, Fallback percentage: 82.05%
Iteration 4: swpout inc: 19, swpout fallback inc: 214, Fallback percentage: 91.85%
Iteration 5: swpout inc: 12, swpout fallback inc: 213, Fallback percentage: 94.67%
Iteration 6: swpout inc: 11, swpout fallback inc: 217, Fallback percentage: 95.18%
Iteration 7: swpout inc: 9, swpout fallback inc: 214, Fallback percentage: 95.96%
Iteration 8: swpout inc: 8, swpout fallback inc: 213, Fallback percentage: 96.38%
Iteration 9: swpout inc: 2, swpout fallback inc: 223, Fallback percentage: 99.11%
Iteration 10: swpout inc: 2, swpout fallback inc: 228, Fallback percentage: 99.13%
Iteration 11: swpout inc: 4, swpout fallback inc: 214, Fallback percentage: 98.17%
Iteration 12: swpout inc: 5, swpout fallback inc: 226, Fallback percentage: 97.84%
Iteration 13: swpout inc: 3, swpout fallback inc: 212, Fallback percentage: 98.60%
Iteration 14: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
Iteration 15: swpout inc: 3, swpout fallback inc: 222, Fallback percentage: 98.67%
Iteration 16: swpout inc: 4, swpout fallback inc: 223, Fallback percentage: 98.24%

=========
Kernel compile under tmpfs with cgroup memory.max = 470M.
12 core 24 hyperthreading, 32 jobs. 10 Run each group

SSD swap 10 runs average, 20G swap partition:
With:
user 2929.064
system 1479.381 : 1376.89 1398.22 1444.64 1477.39 1479.04 1497.27
1504.47 1531.4 1532.92 1551.57
real 1441.324

Without:
user 2910.872
system 1482.732 : 1440.01 1451.4 1462.01 1467.47 1467.51 1469.3
1470.19 1496.32 1544.1 1559.01
real 1580.822

Two zram swap: zram0 3.0G zram1 20G.

The idea is forcing the zram0 almost full then overflow to zram1:

With:
user 4320.301
system 4272.403 : 4236.24 4262.81 4264.75 4269.13 4269.44 4273.06
4279.85 4285.98 4289.64 4293.13
real 431.759

Without
user 4301.393
system 4387.672 : 4374.47 4378.3 4380.95 4382.84 4383.06 4388.05
4389.76 4397.16 4398.23 4403.9
real 433.979

------ more test result from Kaiui ----------

Test with build linux kernel using a 4G ZRAM, 1G memory.max limit on top of shmem:

System info: 32 Core AMD Zen2, 64G total memory.

Test 3 times using only 4K pages:
=================================

With:
-----
1838.74user 2411.21system 2:37.86elapsed 2692%CPU (0avgtext+0avgdata 847060maxresident)k
1839.86user 2465.77system 2:39.35elapsed 2701%CPU (0avgtext+0avgdata 847060maxresident)k
1840.26user 2454.68system 2:39.43elapsed 2693%CPU (0avgtext+0avgdata 847060maxresident)k

Summary (~4.6% improment of system time):
User: 1839.62
System: 2443.89: 2465.77 2454.68 2411.21
Real: 158.88

Without:
--------
1837.99user 2575.95system 2:43.09elapsed 2706%CPU (0avgtext+0avgdata 846520maxresident)k
1838.32user 2555.15system 2:42.52elapsed 2709%CPU (0avgtext+0avgdata 846520maxresident)k
1843.02user 2561.55system 2:43.35elapsed 2702%CPU (0avgtext+0avgdata 846520maxresident)k

Summary:
User: 1839.78
System: 2564.22: 2575.95 2555.15 2561.55
Real: 162.99

Test 5 times using enabled all mTHP pages:
==========================================

With:
-----
1796.44user 2937.33system 2:59.09elapsed 2643%CPU (0avgtext+0avgdata 846936maxresident)k
1802.55user 3002.32system 2:54.68elapsed 2750%CPU (0avgtext+0avgdata 847072maxresident)k
1806.59user 2986.53system 2:55.17elapsed 2736%CPU (0avgtext+0avgdata 847092maxresident)k
1803.27user 2982.40system 2:54.49elapsed 2742%CPU (0avgtext+0avgdata 846796maxresident)k
1807.43user 3036.08system 2:56.06elapsed 2751%CPU (0avgtext+0avgdata 846488maxresident)k

Summary (~8.4% improvement of system time):
User: 1803.25
System: 2988.93: 2937.33 3002.32 2986.53 2982.40 3036.08
Real: 175.90

mTHP swapout status:
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout:347721
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout_fallback:3110
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout:3365
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout_fallback:8269
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout:24
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout_fallback:3341
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout:145
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout_fallback:5038
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout:322737
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback:36808
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout:380455
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout_fallback:1010
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout:24973
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout_fallback:13223
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout:197348
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout_fallback:80541

Without:
--------
1794.41user 3151.29system 3:05.97elapsed 2659%CPU (0avgtext+0avgdata 846704maxresident)k
1810.27user 3304.48system 3:05.38elapsed 2759%CPU (0avgtext+0avgdata 846636maxresident)k
1809.84user 3254.85system 3:03.83elapsed 2755%CPU (0avgtext+0avgdata 846952maxresident)k
1813.54user 3259.56system 3:04.28elapsed 2752%CPU (0avgtext+0avgdata 846848maxresident)k
1829.97user 3338.40system 3:07.32elapsed 2759%CPU (0avgtext+0avgdata 847024maxresident)k

Summary:
User: 1811.61
System: 3261.72 : 3151.29 3304.48 3254.85 3259.56 3338.40
Real: 185.356

mTHP swapout status:
hugepages-32kB/stats/swpout:35630
hugepages-32kB/stats/swpout_fallback:1809908
hugepages-512kB/stats/swpout:523
hugepages-512kB/stats/swpout_fallback:55235
hugepages-2048kB/stats/swpout:53
hugepages-2048kB/stats/swpout_fallback:17264
hugepages-1024kB/stats/swpout:85
hugepages-1024kB/stats/swpout_fallback:24979
hugepages-64kB/stats/swpout:30117
hugepages-64kB/stats/swpout_fallback:1825399
hugepages-16kB/stats/swpout:42775
hugepages-16kB/stats/swpout_fallback:1951123
hugepages-256kB/stats/swpout:2326
hugepages-256kB/stats/swpout_fallback:170165
hugepages-128kB/stats/swpout:17925
hugepages-128kB/stats/swpout_fallback:1309757


This patch (of 9):

Previously, the swap cluster used a cluster index as a pointer to
construct a custom single link list type "swap_cluster_list". The next
cluster pointer is shared with the cluster->count. It prevents puting the
non free cluster into a list.

Change the cluster to use the standard double link list instead. This
allows tracing the nonfull cluster in the follow up patch. That way, it
is faster to get to the nonfull cluster of that order.

Remove the cluster getter/setter for accessing the cluster struct member.

The list operation is protected by the swap_info_struct->lock.

Change cluster code to use "struct swap_cluster_info *" to reference the
cluster rather than by using index. That is more consistent with the list
manipulation. It avoids the repeat adding index to the cluser_info. The
code is easier to understand.

Remove the cluster next pointer is NULL flag, the double link list can
handle the empty list pretty well.

The "swap_cluster_info" struct is two pointer bigger, because 512 swap
entries share one swap_cluster_info struct, it has very little impact on
the average memory usage per swap entry. For 1TB swapfile, the swap
cluster data structure increases from 8MB to 24MB.

Other than the list conversion, there is no real function change in this
patch.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Li <[email protected]>
Reported-by: Barry Song <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


12345678910>>...17