|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
ff22f929 |
| 06-Mar-2025 |
Nhat Pham <[email protected]> |
page_io: zswap: do not crash the kernel on decompression failure
Currently, we crash the kernel when a decompression failure occurs in zswap (either because of memory corruption, or a bug in the com
page_io: zswap: do not crash the kernel on decompression failure
Currently, we crash the kernel when a decompression failure occurs in zswap (either because of memory corruption, or a bug in the compression algorithm). This is overkill. We should only SIGBUS the unfortunate process asking for the zswap entry on zswap load, and skip the corrupted entry in zswap writeback.
See [1] for a recent upstream discussion about this.
The zswap writeback case is relatively straightforward to fix. For the zswap_load() case, we change the return behavior:
* Return 0 on success. * Return -ENOENT (with the folio locked) if zswap does not own the swapped out content. * Return -EIO if zswap owns the swapped out content, but encounters a decompression failure for some reasons. The folio will be unlocked, but not be marked up-to-date, which will eventually cause the process requesting the page to SIGBUS (see the handling of not-up-to-date folio in do_swap_page() in mm/memory.c), without crashing the kernel. * Return -EINVAL if we encounter a large folio, as large folio should not be swapped in while zswap is being used. Similar to the -EIO case, we also unlock the folio but do not mark it as up-to-date to SIGBUS the faulting process.
As a side effect, we require one extra zswap tree traversal in the load and writeback paths. Quick benchmarking on a kernel build test shows no performance difference:
With the new scheme: real: mean: 125.1s, stdev: 0.12s user: mean: 3265.23s, stdev: 9.62s sys: mean: 2156.41s, stdev: 13.98s
The old scheme: real: mean: 125.78s, stdev: 0.45s user: mean: 3287.18s, stdev: 5.95s sys: mean: 2177.08s, stdev: 26.52s
[[email protected]: fix documentation of zswap_load()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Suggested-by: Yosry Ahmed <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
6e1fa555 |
| 01-Oct-2024 |
Kanchana P Sridhar <[email protected]> |
mm: zswap: modify zswap_stored_pages to be atomic_long_t
For zswap_store() to support large folios, we need to be able to do a batch update of zswap_stored_pages upon successful store of all pages i
mm: zswap: modify zswap_stored_pages to be atomic_long_t
For zswap_store() to support large folios, we need to be able to do a batch update of zswap_stored_pages upon successful store of all pages in the folio. For this, we need to add folio_nr_pages(), which returns a long, to zswap_stored_pages.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kanchana P Sridhar <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Usama Arif <[email protected]> Cc: Wajdi Feghali <[email protected]> Cc: "Zou, Nanhai" <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3 |
|
| #
e31c38e0 |
| 05-Aug-2024 |
Nhat Pham <[email protected]> |
zswap: implement a second chance algorithm for dynamic zswap shrinker
Patch series "improving dynamic zswap shrinker protection scheme", v3.
When experimenting with the memory-pressure based (i.e "
zswap: implement a second chance algorithm for dynamic zswap shrinker
Patch series "improving dynamic zswap shrinker protection scheme", v3.
When experimenting with the memory-pressure based (i.e "dynamic") zswap shrinker in production, we observed a sharp increase in the number of swapins, which led to performance regression. We were able to trace this regression to the following problems with the shrinker's warm pages protection scheme:
1. The protection decays way too rapidly, and the decaying is coupled with zswap stores, leading to anomalous patterns, in which a small batch of zswap stores effectively erase all the protection in place for the warmer pages in the zswap LRU.
This observation has also been corroborated upstream by Takero Funaki (in [1]).
2. We inaccurately track the number of swapped in pages, missing the non-pivot pages that are part of the readahead window, while counting the pages that are found in the zswap pool.
To alleviate these two issues, this patch series improve the dynamic zswap shrinker in the following manner:
1. Replace the protection size tracking scheme with a second chance algorithm. This new scheme removes the need for haphazard stats decaying, and automatically adjusts the pace of pages aging with memory pressure, and writeback rate with pool activities: slowing down when the pool is dominated with zswpouts, and speeding up when the pool is dominated with stale entries.
2. Fix the tracking of the number of swapins to take into account non-pivot pages in the readahead window.
With these two changes in place, in a kernel-building benchmark without any cold data added, the number of swapins is reduced by 64.12%. This translate to a 10.32% reduction in build time. We also observe a 3% reduction in kernel CPU time.
In another benchmark, with cold data added (to gauge the new algorithm's ability to offload cold data), the new second chance scheme outperforms the old protection scheme by around 0.7%, and actually written back around 21% more pages to backing swap device. So the new scheme is just as good, if not even better than the old scheme on this front as well.
[1]: https://lore.kernel.org/linux-mm/CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com/
This patch (of 2):
Current zswap shrinker's heuristics to prevent overshrinking is brittle and inaccurate, specifically in the way we decay the protection size (i.e making pages in the zswap LRU eligible for reclaim).
We currently decay protection aggressively in zswap_lru_add() calls. This leads to the following unfortunate effect: when a new batch of pages enter zswap, the protection size rapidly decays to below 25% of the zswap LRU size, which is way too low.
We have observed this effect in production, when experimenting with the zswap shrinker: the rate of shrinking shoots up massively right after a new batch of zswap stores. This is somewhat the opposite of what we want originally - when new pages enter zswap, we want to protect both these new pages AND the pages that are already protected in the zswap LRU.
Replace existing heuristics with a second chance algorithm
1. When a new zswap entry is stored in the zswap pool, its referenced bit is set. 2. When the zswap shrinker encounters a zswap entry with the referenced bit set, give it a second chance - only flips the referenced bit and rotate it in the LRU. 3. If the shrinker encounters the entry again, this time with its referenced bit unset, then it can reclaim the entry.
In this manner, the aging of the pages in the zswap LRUs are decoupled from zswap stores, and picks up the pace with increasing memory pressure (which is what we want).
The second chance scheme allows us to modulate the writeback rate based on recent pool activities. Entries that recently entered the pool will be protected, so if the pool is dominated by such entries the writeback rate will reduce proportionally, protecting the workload's workingset.On the other hand, stale entries will be written back quickly, which increases the effective writeback rate.
The referenced bit is added at the hole after the `length` field of struct zswap_entry, so there is no extra space overhead for this algorithm.
We will still maintain the count of swapins, which is consumed and subtracted from the lru size in zswap_shrinker_count(), to further penalize past overshrinking that led to disk swapins. The idea is that had we considered this many more pages in the LRU active/protected, they would not have been written back and we would not have had to swapped them in.
To test this new heuristics, I built the kernel under a cgroup with memory.max set to 2G, on a host with 36 cores:
With the old shrinker:
real: 263.89s user: 4318.11s sys: 673.29s swapins: 227300.5
With the second chance algorithm:
real: 244.85s user: 4327.22s sys: 664.39s swapins: 94663
(average over 5 runs)
We observe an 1.3% reduction in kernel CPU usage, and around 7.2% reduction in real time. Note that the number of swapped in pages dropped by 58%.
[[email protected]: fix a small mistake in the referenced bit documentation] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Takero Funaki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6 |
|
| #
259043e3 |
| 29-Jun-2024 |
Barry Song <[email protected]> |
mm: zswap: fix zswap_never_enabled() for CONFIG_ZSWAP==N
If CONFIG_ZSWAP is set to N, it means zswap cannot be enabled. zswap_never_enabled() should return true.
The only effect of this issue is t
mm: zswap: fix zswap_never_enabled() for CONFIG_ZSWAP==N
If CONFIG_ZSWAP is set to N, it means zswap cannot be enabled. zswap_never_enabled() should return true.
The only effect of this issue is that with Barry's latest large folio swapin patches for zram ("mm: support mTHP swap-in for zRAM-like swapfile"), we will always fallback to order-0 swapin, even mistakenly when !CONFIG_ZSWAP.
Basically this bug makes Barry's in progress patches not work at all.
The API was created to inform the mm core that zswap has never been enabled, allowing the mm core to perform mTHP swap-in. This is a transitional solution until zswap supports mTHP. If zswap has been enabled, performing mTHP swap-in will result in corrupted data. You may find the answer in the mTHP swap-in series:
https://lore.kernel.org/linux-mm/CAJD7tkZ4FQr6HZpduOdvmqgg_-whuZYE-Bz5O2t6yzw6Yg+v1A@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected] Fixes: 0300e17d67c3 ("mm: zswap: add zswap_never_enabled()") Signed-off-by: Barry Song <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Acked-by: Chris Li <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc5, v6.10-rc4 |
|
| #
2d4d2b1c |
| 11-Jun-2024 |
Yosry Ahmed <[email protected]> |
mm: zswap: add zswap_never_enabled()
Add zswap_never_enabled() to skip the xarray lookup in zswap_load() if zswap was never enabled on the system. It is implemented using static branches for effici
mm: zswap: add zswap_never_enabled()
Add zswap_never_enabled() to skip the xarray lookup in zswap_load() if zswap was never enabled on the system. It is implemented using static branches for efficiency, as enabling zswap should be a rare event. This could shave some cycles off zswap_load() when CONFIG_ZSWAP is used but zswap is never enabled.
However, the real motivation behind this patch is two-fold: - Incoming large folio swapin work will need to fallback to order-0 folios if zswap was ever enabled, because any part of the folio could be in zswap, until proper handling of large folios with zswap is added.
- A warning and recovery attempt will be added in a following change in case the above was not done incorrectly. Zswap will fail the read if the folio is large and it was ever enabled.
Expose zswap_never_enabled() in the header for the swapin work to use it later.
[[email protected]: expose zswap_never_enabled() in the header] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Cc: Barry Song <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Chris Li <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
2b33a97c |
| 11-Jun-2024 |
Yosry Ahmed <[email protected]> |
mm: zswap: rename is_zswap_enabled() to zswap_is_enabled()
In preparation for introducing a similar function, rename is_zswap_enabled() to use zswap_* prefix like other zswap functions.
Link: https
mm: zswap: rename is_zswap_enabled() to zswap_is_enabled()
In preparation for introducing a similar function, rename is_zswap_enabled() to use zswap_* prefix like other zswap functions.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Reviewed-by: Barry Song <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Chris Li <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
91cdcd8d |
| 12-Mar-2024 |
Johannes Weiner <[email protected]> |
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total cycles currently going into updating the zswap_pool_total_size.
There are thre
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter: - store, to enforce the globally configured pool limit - meminfo & debugfs, to report the size to the user - shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for every freed entry. As the shrinker might work on tens or hundreds of objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps. There could be millions of pages being processed until somebody asks for the pool size again. This eliminates the pool size updates from those paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size 12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size 9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size 2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap 2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free 2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free 7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap 6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap that is updated by the backends, since zswap only cares about the sum of all pools anyway. However, zram directly needs per-pool information out of zsmalloc. To keep the backend from having to update two atomics every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3 |
|
| #
0827a1fb |
| 04-Feb-2024 |
Chengming Zhou <[email protected]> |
mm/zswap: invalidate zswap entry when swap entry free
During testing I found there are some times the zswap_writeback_entry() return -ENOMEM, which is not we expected:
bpftrace -e 'kr:zswap_writeba
mm/zswap: invalidate zswap entry when swap entry free
During testing I found there are some times the zswap_writeback_entry() return -ENOMEM, which is not we expected:
bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}' @[-12]: 1563 @[0]: 277221
The reason is that __read_swap_cache_async() return NULL because swapcache_prepare() failed. The reason is that we won't invalidate zswap entry when swap entry freed to the per-cpu pool, these zswap entries are still on the zswap tree and lru list.
This patch moves the invalidation ahead to when swap entry freed to the per-cpu pool, since there is no any benefit to leave trashy zswap entry on the tree and lru list.
With this patch: bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}' @[0]: 259744
Note: large folio can't have zswap entry for now, so don't bother to add zswap entry invalidation in the large folio swap free path.
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-2-99d4084260a0@bytedance.com Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc2, v6.8-rc1 |
|
| #
44c7c734 |
| 19-Jan-2024 |
Chengming Zhou <[email protected]> |
mm/zswap: split zswap rb-tree
Each swapfile has one rb-tree to search the mapping of swp_entry_t to zswap_entry, that use a spinlock to protect, which can cause heavy lock contention if multiple tas
mm/zswap: split zswap rb-tree
Each swapfile has one rb-tree to search the mapping of swp_entry_t to zswap_entry, that use a spinlock to protect, which can cause heavy lock contention if multiple tasks zswap_store/load concurrently.
Optimize the scalability problem by splitting the zswap rb-tree into multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M), just like we did in the swap cache address_space splitting.
Although this method can't solve the spinlock contention completely, it can mitigate much of that contention. Below is the results of kernel build in tmpfs with zswap shrinker enabled:
linux-next zswap-lock-optimize real 1m9.181s 1m3.820s user 17m44.036s 17m40.100s sys 7m37.297s 4m54.622s
So there are clearly improvements.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Nhat Pham <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Cc: Chris Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
bb29fd77 |
| 19-Jan-2024 |
Chengming Zhou <[email protected]> |
mm/zswap: make sure each swapfile always have zswap rb-tree
Patch series "mm/zswap: optimize the scalability of zswap rb-tree", v2.
When testing the zswap performance by using kernel build -j32 in
mm/zswap: make sure each swapfile always have zswap rb-tree
Patch series "mm/zswap: optimize the scalability of zswap rb-tree", v2.
When testing the zswap performance by using kernel build -j32 in a tmpfs directory, I found the scalability of zswap rb-tree is not good, which is protected by the only spinlock. That would cause heavy lock contention if multiple tasks zswap_store/load concurrently.
So a simple solution is to split the only one zswap rb-tree into multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
Although this method can't solve the spinlock contention completely, it can mitigate much of that contention. Below is the results of kernel build in tmpfs with zswap shrinker enabled:
linux-next zswap-lock-optimize real 1m9.181s 1m3.820s user 17m44.036s 17m40.100s sys 7m37.297s 4m54.622s
So there are clearly improvements. And it's complementary with the ongoing zswap xarray conversion by Chris. Anyway, I think we can also merge this first, it's complementary IMHO. So I just refresh and resend this for further discussion.
This patch (of 2):
Not all zswap interfaces can handle the absence of the zswap rb-tree, actually only zswap_store() has handled it for now.
To make things simple, we make sure each swapfile always have the zswap rb-tree prepared before being enabled and used. The preparation is unlikely to fail in practice, this patch just make it explicit.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Cc: Chris Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5 |
|
| #
501a06fe |
| 07-Dec-2023 |
Nhat Pham <[email protected]> |
zswap: memcontrol: implement zswap writeback disabling
During our experiment with zswap, we sometimes observe swap IOs due to occasional zswap store failures and writebacks-to-swap. These swapping
zswap: memcontrol: implement zswap writeback disabling
During our experiment with zswap, we sometimes observe swap IOs due to occasional zswap store failures and writebacks-to-swap. These swapping IOs prevent many users who cannot tolerate swapping from adopting zswap to save memory and improve performance where possible.
This patch adds the option to disable this behavior entirely: do not writeback to backing swapping device when a zswap store attempt fail, and do not write pages in the zswap pool back to the backing swap device (both when the pool is full, and when the new zswap shrinker is called).
This new behavior can be opted-in/out on a per-cgroup basis via a new cgroup file. By default, writebacks to swap device is enabled, which is the previous behavior. Initially, writeback is enabled for the root cgroup, and a newly created cgroup will inherit the current setting of its parent.
Note that this is subtly different from setting memory.swap.max to 0, as it still allows for pages to be stored in the zswap pool (which itself consumes swap space in its current form).
This patch should be applied on top of the zswap shrinker series:
https://lore.kernel.org/linux-mm/[email protected]/
as it also disables the zswap shrinker, a major source of zswap writebacks.
For the most part, this feature is motivated by internal parties who have already established their opinions regarding swapping - the workloads that are highly sensitive to IO, and especially those who are using servers with really slow disk performance (for instance, massive but slow HDDs). For these folks, it's impossible to convince them to even entertain zswap if swapping also comes as a packaged deal. Writeback disabling is quite a useful feature in these situations - on a mixed workloads deployment, they can disable writeback for the more IO-sensitive workloads, and enable writeback for other background workloads.
For instance, on a server with HDD, I allocate memories and populate them with random values (so that zswap store will always fail), and specify memory.high low enough to trigger reclaim. The time it takes to allocate the memories and just read through it a couple of times (doing silly things like computing the values' average etc.):
zswap.writeback disabled: real 0m30.537s user 0m23.687s sys 0m6.637s 0 pages swapped in 0 pages swapped out
zswap.writeback enabled: real 0m45.061s user 0m24.310s sys 0m8.892s 712686 pages swapped in 461093 pages swapped out
(the last two lines are from vmstat -s).
[[email protected]: add a comment about recurring zswap store failures leading to reclaim inefficiency] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Acked-by: Chris Li <[email protected]> Cc: Dan Streetman <[email protected]> Cc: David Heidelberg <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
96c7b0b4 |
| 13-Dec-2023 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: return the folio from __read_swap_cache_async()
Patch series "More swap folio conversions".
These all seem like fairly straightforward conversions to me. A lot of compound_head() calls get rem
mm: return the folio from __read_swap_cache_async()
Patch series "More swap folio conversions".
These all seem like fairly straightforward conversions to me. A lot of compound_head() calls get removed. And page_swap_info(), which is nice.
This patch (of 13):
Move the folio->page conversion into the callers that actually want that. Most of the callers are happier with the folio anyway. If the page_allocated boolean is set, the folio allocated is of order-0, so it is safe to pass the page directly to swap_readpage().
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc4 |
|
| #
b5ba474f |
| 30-Nov-2023 |
Nhat Pham <[email protected]> |
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlik
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is initiated when there is memory pressure. The shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent overshrinking (i.e evicting warm pages that might be refaulted into memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to protect this region of the zswap LRU. * Scale the number of freeable objects by an estimate of the memory saving factor. The better zswap compresses the data, the fewer pages we will evict to swap (as we will otherwise incur IO for relatively small memory saving). * During reclaim, if the shrinker encounters a page that is also being brought into memory, the shrinker will cautiously terminate its shrinking action, as this is a sign that it is touching the warmer region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[[email protected]: check shrinker enablement early, use less costly stat flushing] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Cc: Chris Li <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Chengming Zhou <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
a65b0e76 |
| 30-Nov-2023 |
Domenico Cerasuolo <[email protected]> |
zswap: make shrinking memcg-aware
Currently, we only have a single global LRU for zswap. This makes it impossible to perform worload-specific shrinking - an memcg cannot determine which pages in th
zswap: make shrinking memcg-aware
Currently, we only have a single global LRU for zswap. This makes it impossible to perform worload-specific shrinking - an memcg cannot determine which pages in the pool it owns, and often ends up writing pages from other memcgs. This issue has been previously observed in practice and mitigated by simply disabling memcg-initiated shrinking:
https://lore.kernel.org/all/[email protected]/T/#u
This patch fully resolves the issue by replacing the global zswap LRU with memcg- and NUMA-specific LRUs, and modify the reclaim logic:
a) When a store attempt hits an memcg limit, it now triggers a synchronous reclaim attempt that, if successful, allows the new hotter page to be accepted by zswap. b) If the store attempt instead hits the global zswap limit, it will trigger an asynchronous reclaim attempt, in which an memcg is selected for reclaim in a round-robin-like fashion.
[[email protected]: use correct function for the onlineness check, use mem_cgroup_iter_break()] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: drop the pool's reference at the end of the writeback step] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Domenico Cerasuolo <[email protected]> Co-developed-by: Nhat Pham <[email protected]> Signed-off-by: Nhat Pham <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Cc: Chris Li <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2 |
|
| #
ca54f6d8 |
| 15-Jul-2023 |
Matthew Wilcox (Oracle) <[email protected]> |
zswap: make zswap_load() take a folio
Only convert a few easy parts of this function to use the folio passed in; convert back to struct page for the majority of it. Removes three hidden calls to co
zswap: make zswap_load() take a folio
Only convert a few easy parts of this function to use the folio passed in; convert back to struct page for the majority of it. Removes three hidden calls to compound_head().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
34f4c198 |
| 15-Jul-2023 |
Matthew Wilcox (Oracle) <[email protected]> |
zswap: make zswap_store() take a folio
Patch series "Followup folio conversions for zswap".
With frontswap killed, it's worth converting the zswap_load() and zswap_store() functions to take a folio
zswap: make zswap_store() take a folio
Patch series "Followup folio conversions for zswap".
With frontswap killed, it's worth converting the zswap_load() and zswap_store() functions to take a folio instead of a page pointer. They aren't converted to support large folios, but there are a lot of unnecessary calls to compound_head() that are removed by these patches.
This patch (of 4):
Only convert a few easy parts of this function to use the folio passed in; convert back to struct page for the majority of it. This does remove a few hidden calls to compound_head().
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
42c06a0e |
| 17-Jul-2023 |
Johannes Weiner <[email protected]> |
mm: kill frontswap
The only user of frontswap is zswap, and has been for a long time. Have swap call into zswap directly and remove the indirection.
[[email protected]: remove obsolete comment, p
mm: kill frontswap
The only user of frontswap is zswap, and has been for a long time. Have swap call into zswap directly and remove the indirection.
[[email protected]: remove obsolete comment, per Yosry] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: don't warn if none swapcache folio is passed to zswap_load] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Yin Fengwei <[email protected]> Acked-by: Konrad Rzeszutek Wilk <[email protected]> Acked-by: Nhat Pham <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|