vm.rst - OpenGrok history log for /linux-6.15/Documentation/admin-guide/sysctl/vm.rst

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7
# e3aa7df3	13-Mar-2025	Johannes Weiner <[email protected]>	mm: page_alloc: defrag_mode The page allocator groups requests by migratetype to stave off fragmentation. However, in practice this is routinely defeated by the fact that it gives up before invok mm: page_alloc: defrag_mode The page allocator groups requests by migratetype to stave off fragmentation. However, in practice this is routinely defeated by the fact that it gives up before invoking reclaim and compaction - which may well produce suitable pages. As a result, fragmentation of physical memory is a common ongoing process in many load scenarios. Fragmentation deteriorates compaction's ability to produce huge pages. Depending on the lifetime of the fragmenting allocations, those effects can be long-lasting or even permanent, requiring drastic measures like forcible idle states or even reboots as the only reliable ways to recover the address space for THP production. In a kernel build test with supplemental THP pressure, the THP allocation rate steadily declines over 15 runs: thp_fault_alloc 61988 56474 57258 50187 52388 55409 52925 47648 43669 40621 36077 41721 36685 34641 33215 This is a hurdle in adopting THP in any environment where hosts are shared between multiple overlapping workloads (cloud environments), and rarely experience true idle periods. To make THP a reliable and predictable optimization, there needs to be a stronger guarantee to avoid such fragmentation. Introduce defrag_mode. When enabled, reclaim/compaction is invoked to its full extent before falling back. Specifically, ALLOC_NOFRAGMENT is enforced on the allocator fastpath and the reclaiming slowpath. For now, fallbacks are permitted to avert OOMs. There is a plan to add defrag_mode=2 to prefer OOMs over fragmentation, but this requires additional prep work in compaction and the reserve management to make it ready for all possible allocation contexts. The following test results are from a kernel build with periodic bursts of THP allocations, over 15 runs: vanilla defrag_mode=1 @claimer[unmovable]: 189 103 @claimer[movable]: 92 103 @claimer[reclaimable]: 207 61 @pollute[unmovable from movable]: 25 0 @pollute[unmovable from reclaimable]: 28 0 @pollute[movable from unmovable]: 38835 0 @pollute[movable from reclaimable]: 147136 0 @pollute[reclaimable from unmovable]: 178 0 @pollute[reclaimable from movable]: 33 0 @steal[unmovable from movable]: 11 0 @steal[unmovable from reclaimable]: 5 0 @steal[reclaimable from unmovable]: 107 0 @steal[reclaimable from movable]: 90 0 @steal[movable from reclaimable]: 354 0 @steal[movable from unmovable]: 130 0 Both types of polluting fallbacks are eliminated in this workload. Interestingly, whole block conversions are reduced as well. This is because once a block is claimed for a type, its empty space remains available for future allocations, instead of being padded with fallbacks; this allows the native type to group up instead of spreading out to new blocks. The assumption in the allocator has been that pollution from movable allocations is less harmful than from other types, since they can be reclaimed or migrated out should the space be needed. However, since fallbacks occur before reclaim/compaction is invoked, movable pollution will still cause non-movable allocations to spread out and claim more blocks. Without fragmentation, THP rates hold steady with defrag_mode=1: thp_fault_alloc 32478 20725 45045 32130 14018 21711 40791 29134 34458 45381 28305 17265 22584 28454 30850 While the downward trend is eliminated, the keen reader will of course notice that the baseline rate is much smaller than the vanilla kernel's to begin with. This is due to deficiencies in how reclaim and compaction are currently driven: ALLOC_NOFRAGMENT increases the extent to which smaller allocations are competing with THPs for pageblocks, while making no effort themselves to reclaim or compact beyond their own request size. This effect already exists with the current usage of ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole block stealing much more strongly. Subsequent patches will address defrag_mode reclaim strategy to raise the THP success baseline above the vanilla kernel. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> show more ...
Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6
# 44195d1e	26-Jun-2024	Jiaqi Yan <[email protected]>	docs: mm: add enable_soft_offline sysctl Add the documentation for soft offline behaviors / costs, and what the new enable_soft_offline sysctl is for. [[email protected]: fix kerneldoc warnings] docs: mm: add enable_soft_offline sysctl Add the documentation for soft offline behaviors / costs, and what the new enable_soft_offline sysctl is for. [[email protected]: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/CACw3F52=GxTCDw-PqFh3-GDM-fo3GbhGdu0hedxYXOTT4TQSTg@mail.gmail.com [[email protected]: there are more blank lines needed] Link: https://lkml.kernel.org/r/CACw3F52_obAB742XeDRNun4BHBYtrxtbvp5NkUincXdaob0j1g@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiaqi Yan <[email protected]> Acked-by: Oscar Salvador <[email protected]> Acked-by: Miaohe Lin <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Frank van der Linden <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Lance Yang <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> show more ...
Revision tags: v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1
# 22d407b1	21-Mar-2024	Suren Baghdasaryan <[email protected]>	lib: add allocation tagging support for memory allocation profiling Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily instrument memory allocators. It registers an "alloc_ta lib: add allocation tagging support for memory allocation profiling Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily instrument memory allocators. It registers an "alloc_tags" codetag type with /proc/allocinfo interface to output allocation tag information when the feature is enabled. CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory allocation profiling instrumentation. Memory allocation profiling can be enabled or disabled at runtime using /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n. CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation profiling by default. [[email protected]: Documentation/filesystems/proc.rst: fix allocinfo title] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: explicitly include irqflags.h in alloc_tag.h] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Co-developed-by: Kent Overstreet <[email protected]> Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Klara Modin <[email protected]> Tested-by: Kees Cook <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alex Gaynor <[email protected]> Cc: Alice Ryhl <[email protected]> Cc: Andreas Hindborg <[email protected]> Cc: Benno Lossin <[email protected]> Cc: "Björn Roy Baron" <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Gary Guo <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wedson Almeida Filho <[email protected]> Signed-off-by: Andrew Morton <[email protected]> show more ...
Revision tags: v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4
# d17ff438	29-Aug-2022	Vratislav Bendel <[email protected]>	docs: mm: fix vm overcommit documentation for OVERCOMMIT_GUESS Commit 8c7829b04c52 "mm: fix false-positive OVERCOMMIT_GUESS failures" changed the behavior of the default OVERCOMMIT_GUESS setting. Re docs: mm: fix vm overcommit documentation for OVERCOMMIT_GUESS Commit 8c7829b04c52 "mm: fix false-positive OVERCOMMIT_GUESS failures" changed the behavior of the default OVERCOMMIT_GUESS setting. Reflect the change also in the Documentation, namely files: Documentation/admin-guide/sysctl/vm.rst Documentation/mm/overcommit-accounting.rst Reported-by: Jozef Bacik <[email protected]> Signed-off-by: Vratislav Bendel <[email protected]> Acked-by: Mike Rapoport <[email protected]> Signed-off-by: Jonathan Corbet <[email protected]> Link: https://lore.kernel.org/r/[email protected] show more ...
# dbeb56fe	29-Jan-2023	Randy Dunlap <[email protected]>	Documentation: admin-guide: correct spelling Correct spelling problems for Documentation/admin-guide/ as reported by codespell. Signed-off-by: Randy Dunlap <[email protected]> Reviewed-by: Muke Documentation: admin-guide: correct spelling Correct spelling problems for Documentation/admin-guide/ as reported by codespell. Signed-off-by: Randy Dunlap <[email protected]> Reviewed-by: Mukesh Ojha <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zefan Li <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: [email protected] Cc: Alasdair Kergon <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: [email protected] Cc: Mauro Carvalho Chehab <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jonathan Corbet <[email protected]> show more ...
Revision tags: v6.0-rc3, v6.0-rc2, v6.0-rc1
# 816284a3	08-Aug-2022	Axel Rasmussen <[email protected]>	userfaultfd: update documentation to describe /dev/userfaultfd Explain the different ways to create a new userfaultfd, and how access control works for each way. [[email protected]: improve userfaultfd: update documentation to describe /dev/userfaultfd Explain the different ways to create a new userfaultfd, and how access control works for each way. [[email protected]: improve wording in documentation, per Mike] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Acked-by: Peter Xu <[email protected]> Reviewed-by: Shuah Khan <[email protected]> Cc: Al Viro <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dmitry V. Levin <[email protected]> Cc: Gleb Fotengauer-Malinovskiy <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zhang Yi <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> show more ...
Revision tags: v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5
# dff03381	28-Jun-2022	Muchun Song <[email protected]>	mm: hugetlb_vmemmap: introduce the name HVO It it inconvenient to mention the feature of optimizing vmemmap pages associated with HugeTLB pages when communicating with others since there is no speci mm: hugetlb_vmemmap: introduce the name HVO It it inconvenient to mention the feature of optimizing vmemmap pages associated with HugeTLB pages when communicating with others since there is no specific or abbreviated name for it when it is first introduced. Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now. This commit also updates the document about "hugetlb_free_vmemmap" by the way discussed in thread [1]. Link: https://lore.kernel.org/all/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> show more ...
Revision tags: v5.19-rc4, v5.19-rc3
# 66361095	17-Jun-2022	Muchun Song <[email protected]>	mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory For now, the feature of hugetlb_free_vmemmap is not compatible with the feature of memory_hotplug.memmap_on_memory, mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory For now, the feature of hugetlb_free_vmemmap is not compatible with the feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes precedence over memory_hotplug.memmap_on_memory. However, someone wants to make memory_hotplug.memmap_on_memory takes precedence over hugetlb_free_vmemmap since memmap_on_memory makes it more likely to succeed memory hotplug in close-to-OOM situations. So the decision of making hugetlb_free_vmemmap take precedence is not wise and elegant. The proper approach is to have hugetlb_vmemmap.c do the check whether the section which the HugeTLB pages belong to can be optimized. If the section's vmemmap pages are allocated from the added memory block itself, hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do the optimization. Then both kernel parameters are compatible. So this patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap pages. The hugetlb_vmemmap can use this flag to detect if a vmemmap page can be optimized. [[email protected]: walk vmemmap page tables to avoid false-positive] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Co-developed-by: Oscar Salvador <[email protected]> Signed-off-by: Oscar Salvador <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> show more ...
# ee65728e	27-Jun-2022	Mike Rapoport <[email protected]>	docs: rename Documentation/vm to Documentation/mm so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: M docs: rename Documentation/vm to Documentation/mm so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Tested-by: Ira Weiny <[email protected]> Acked-by: Jonathan Corbet <[email protected]> Acked-by: Wu XiangCheng <[email protected]> show more ...
Revision tags: v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7
# 78f39084	13-May-2022	Muchun Song <[email protected]>	mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and reboot the server to enable or disable the feature of optimizing vmemm mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and reboot the server to enable or disable the feature of optimizing vmemmap pages associated with HugeTLB pages. However, rebooting usually takes a long time. So add a sysctl to enable or disable the feature at runtime without rebooting. Why we need this? There are 3 use cases. 1) The feature of minimizing overhead of struct page associated with each HugeTLB is disabled by default without passing "hugetlb_free_vmemmap=on" to the boot cmdline. When we (ByteDance) deliver the servers to the users who want to enable this feature, they have to configure the grub (change boot cmdline) and reboot the servers, whereas rebooting usually takes a long time (we have thousands of servers). It's a very bad experience for the users. So we need a approach to enable this feature after rebooting. This is a use case in our practical environment. 2) Some use cases are that HugeTLB pages are allocated 'on the fly' instead of being pulled from the HugeTLB pool, those workloads would be affected with this feature enabled. Those workloads could be identified by the characteristics of they never explicitly allocating huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages' and then let the pages be allocated from the buddy allocator at fault time. We can confirm it is a real use case from the commit 099730d67417. For those workloads, the page fault time could be ~2x slower than before. We suspect those users want to disable this feature if the system has enabled this before and they don't think the memory savings benefit is enough to make up for the performance drop. 3) If the workload which wants vmemmap pages to be optimized and the workload which wants to set 'nr_overcommit_hugepages' and does not want the extera overhead at fault time when the overcommitted pages be allocated from the buddy allocator are deployed in the same server. The user could enable this feature and set 'nr_hugepages' and 'nr_overcommit_hugepages', then disable the feature. In this case, the overcommited HugeTLB pages will not encounter the extra overhead at fault time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Kees Cook <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> show more ...
Revision tags: v5.18-rc6, v5.18-rc5
# 8d98e42f	29-Apr-2022	Joel Savitz <[email protected]>	Documentation/sysctl: document page_lock_unfairness commit 5ef64cc8987a ("mm: allow a controlled amount of unfairness in the page lock") introduced a new systctl but no accompanying documentation. Documentation/sysctl: document page_lock_unfairness commit 5ef64cc8987a ("mm: allow a controlled amount of unfairness in the page lock") introduced a new systctl but no accompanying documentation. Add a simple entry to the documentation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joel Savitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "zhangyi (F)" <[email protected]> Cc: Charan Teja Reddy <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Andrew Morton <[email protected]> show more ...
Revision tags: v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1
# 39c65a94	14-Jan-2022	Suren Baghdasaryan <[email protected]>	mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30% For embedded systems with low total memory, having to run applications with relatively large memory requirements, 10% max limitat mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30% For embedded systems with low total memory, having to run applications with relatively large memory requirements, 10% max limitation for watermark_scale_factor poses an issue of triggering direct reclaim every time such application is started. This results in slow application startup times and bad end-user experience. By increasing watermark_scale_factor max limit we allow vendors more flexibility to choose the right level of kswapd aggressiveness for their device and workload requirements. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Lukas Middendorf <[email protected]> Cc: Antti Palosaari <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Kees Cook <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Zhang Yi <[email protected]> Cc: Fengfei Xi <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> show more ...
Revision tags: v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1
# 65d759c8	02-Sep-2021	Charan Teja Reddy <[email protected]>	mm: compaction: support triggering of proactive compaction by user The proactive compaction[1] gets triggered for every 500msec and run compaction on the node for COMPACTION_HPAGE_ORDER (usually ord mm: compaction: support triggering of proactive compaction by user The proactive compaction[1] gets triggered for every 500msec and run compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages based on the value set to sysctl.compaction_proactiveness. Triggering the compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is not needed for all applications, especially on the embedded system usecases which may have few MB's of RAM. Enabling the proactive compaction in its state will endup in running almost always on such systems. Other side, proactive compaction can still be very much useful for getting a set of higher order pages in some controllable manner(controlled by using the sysctl.compaction_proactiveness). So, on systems where enabling the proactive compaction always may proove not required, can trigger the same from user space on write to its sysctl interface. As an example, say app launcher decide to launch the memory heavy application which can be launched fast if it gets more higher order pages thus launcher can prepare the system in advance by triggering the proactive compaction from userspace. This triggering of proactive compaction is done on a write to sysctl.compaction_proactiveness by user. [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a [[email protected]: tweak vm.rst, per Mike] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Charan Teja Reddy <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Rafael Aquini <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Kees Cook <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: David Rientjes <[email protected]> Cc: Vinayak Menon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> show more ...
Revision tags: v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1
# 48d9f335	29-Jun-2021	Mike Rapoport <[email protected]>	docs: remove description of DISCONTIGMEM Remove description of DISCONTIGMEM from the "Memory Models" document and update VM sysctl description so that it won't mention DISCONIGMEM. Link: https://lk docs: remove description of DISCONTIGMEM Remove description of DISCONTIGMEM from the "Memory Models" document and update VM sysctl description so that it won't mention DISCONIGMEM. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matt Turner <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Vineet Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> show more ...
# 74f44822	29-Jun-2021	Mel Gorman <[email protected]>	mm/page_alloc: introduce vm.percpu_pagelist_high_fraction This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is similar to the old vm.percpu_pagelist_fraction. The old sysctl increa mm/page_alloc: introduce vm.percpu_pagelist_high_fraction This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is similar to the old vm.percpu_pagelist_fraction. The old sysctl increased both pcp->batch and pcp->high with the higher pcp->high potentially reducing zone->lock contention. However, the higher pcp->batch value also potentially increased allocation latency while the PCP was refilled. This sysctl only adjusts pcp->high so that zone->lock contention is potentially reduced but allocation latency during a PCP refill remains the same. # grep -E "high:\|batch" /proc/zoneinfo \| tail -2 high: 649 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=8 # grep -E "high:\|batch" /proc/zoneinfo \| tail -2 high: 35071 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=64 high: 4383 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=0 high: 649 batch: 63 [[email protected]: fix documentation] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Dave Hansen <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> show more ...
# bbbecb35	29-Jun-2021	Mel Gorman <[email protected]>	mm/page_alloc: delete vm.percpu_pagelist_fraction Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2. The per-cpu page allocator (PCP) is meant to reduce contention on the z mm/page_alloc: delete vm.percpu_pagelist_fraction Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2. The per-cpu page allocator (PCP) is meant to reduce contention on the zone lock but the sizing of batch and high is archaic and neither takes the zone size into account or the number of CPUs local to a zone. With larger zones and more CPUs per node, the contention is getting worse. Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch and high values means that the sysctl can reduce zone lock contention but also increase allocation latencies. This series disassociates pcp->high from pcp->batch and then scales pcp->high based on the size of the local zone with limited impact to reclaim and accounting for active CPUs but leaves pcp->batch static. It also adapts the number of pages that can be on the pcp list based on recent freeing patterns. The motivation is partially to adjust to larger memory sizes but is also driven by the fact that large batches of page freeing via release_pages() often shows zone contention as a major part of the problem. Another is a bug report based on an older kernel where a multi-terabyte process can takes several minutes to exit. A workaround was to use vm.percpu_pagelist_fraction to increase the pcp->high value but testing indicated that a production workload could not use the same values because of an increase in allocation latencies. Unfortunately, I cannot reproduce this test case myself as the multi-terabyte machines are in active use but it should alleviate the problem. The series aims to address both and partially acts as a pre-requisite. pcp only works with order-0 which is useless for SLUB (when using high orders) and THP (unconditionally). To store high-order pages on PCP, the pcp->high values need to be increased first. This patch (of 6): The vm.percpu_pagelist_fraction is used to increase the batch and high limits for the per-cpu page allocator (PCP). The intent behind the sysctl is to reduce zone lock acquisition when allocating/freeing pages but it has a problem. While it can decrease contention, it can also increase latency on the allocation side due to unreasonably large batch sizes. This leads to games where an administrator adjusts percpu_pagelist_fraction on the fly to work around contention and allocation latency problems. This series aims to alleviate the problems with zone lock contention while avoiding the allocation-side latency problems. For the purposes of review, it's easier to remove this sysctl now and reintroduce a similar sysctl later in the series that deals only with pcp->high. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Dave Hansen <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> show more ...
Revision tags: v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3
# 51fd43e2	13-Mar-2021	zhangyi (F) <[email protected]>	block_dump: remove comments in docs Now block_dump feature is gone, remove all comments in docs. Signed-off-by: zhangyi (F) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: C block_dump: remove comments in docs Now block_dump feature is gone, remove all comments in docs. Signed-off-by: zhangyi (F) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> show more ...
Revision tags: v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse
# 51998364	24-Feb-2021	Dave Hansen <[email protected]>	mm/vmscan: restore zone_reclaim_mode ABI I went to go add a new RECLAIM_* mode for the zone_reclaim_mode sysctl. Like a good kernel developer, I also went to go update the documentation. I noticed mm/vmscan: restore zone_reclaim_mode ABI I went to go add a new RECLAIM_* mode for the zone_reclaim_mode sysctl. Like a good kernel developer, I also went to go update the documentation. I noticed that the bits in the documentation didn't match the bits in the #defines. The VM never explicitly checks the RECLAIM_ZONE bit. The bit is, however implicitly checked when checking 'node_reclaim_mode==0'. The RECLAIM_ZONE #define was removed in a cleanup. That, by itself is fine. But, when the bit was removed (bit 0) the _other_ bit locations also got changed. That's not OK because the bit values are documented to mean one specific thing. Users surely do not expect the meaning to change from kernel to kernel. The end result is that if someone had a script that did: sysctl vm.zone_reclaim_mode=1 it would have gone from enabling node reclaim for clean unmapped pages to writing out pages during node reclaim after the commit in question. That's not great. Put the bits back the way they were and add a comment so something like this is a bit harder to do again. Update the documentation to make it clear that the first bit is ignored. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Dave Hansen <[email protected]> Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE") Reviewed-by: Ben Widawsky <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Alex Shi <[email protected]> Cc: Daniel Wagner <[email protected]> Cc: "Tobin C. Harding" <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Huang Ying <[email protected]> Cc: Dan Williams <[email protected]> Cc: Qian Cai <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> show more ...
Revision tags: v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10
# c635b0ce	10-Dec-2020	Fengfei Xi <[email protected]>	docs: admin-guide: Fix default value of max_map_count in sysctl/vm.rst Since the default value of sysctl_max_map_count is defined as DEFAULT_MAX_MAP_COUNT from mm/util.c int sysctl_max_map_coun docs: admin-guide: Fix default value of max_map_count in sysctl/vm.rst Since the default value of sysctl_max_map_count is defined as DEFAULT_MAX_MAP_COUNT from mm/util.c int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; DEFAULT_MAX_MAP_COUNT is defined as 65530 (65535-5) in include/linux/mm.h #define MAPCOUNT_ELF_CORE_MARGIN (5) #define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN) Signed-off-by: Fengfei Xi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jonathan Corbet <[email protected]> show more ...
# d0d4730a	15-Dec-2020	Lokesh Gidra <[email protected]>	userfaultfd: add user-mode only option to unprivileged_userfaultfd sysctl knob With this change, when the knob is set to 0, it allows unprivileged users to call userfaultfd, like when it is set to 1 userfaultfd: add user-mode only option to unprivileged_userfaultfd sysctl knob With this change, when the knob is set to 0, it allows unprivileged users to call userfaultfd, like when it is set to 1, but with the restriction that page faults from only user-mode can be handled. In this mode, an unprivileged user (without SYS_CAP_PTRACE capability) must pass UFFD_USER_MODE_ONLY to userfaultd or the API will fail with EPERM. This enables administrators to reduce the likelihood that an attacker with access to userfaultfd can delay faulting kernel code to widen timing windows for other exploits. The default value of this knob is changed to 0. This is required for correct functioning of pipe mutex. However, this will fail postcopy live migration, which will be unnoticeable to the VM guests. To avoid this, set 'vm.userfault = 1' in /sys/sysctl.conf. The main reason this change is desirable as in the short term is that the Android userland will behave as with the sysctl set to zero. So without this commit, any Linux binary using userfaultfd to manage its memory would behave differently if run within the Android userland. For more details, refer to Andrea's reply [1]. [1] https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lokesh Gidra <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: Kees Cook <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Stephen Smalley <[email protected]> Cc: Eric Biggers <[email protected]> Cc: Daniel Colascione <[email protected]> Cc: "Joel Fernandes (Google)" <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Jeff Vander Stoep <[email protected]> Cc: <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Daniel Colascione <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> show more ...
Revision tags: v5.10-rc7
# 751d5b27	04-Dec-2020	Andrew Klychkov <[email protected]>	Documentation: fix multiple typos found in the admin-guide subdirectory Fix thirty five typos in dm-integrity.rst, dm-raid.rst, dm-zoned.rst, verity.rst, writecache.rst, tsx_async_abort.rst, md.rst, Documentation: fix multiple typos found in the admin-guide subdirectory Fix thirty five typos in dm-integrity.rst, dm-raid.rst, dm-zoned.rst, verity.rst, writecache.rst, tsx_async_abort.rst, md.rst, bttv.rst, dvb_references.rst, frontend-cardlist.rst, gspca-cardlist.rst, ipu3.rst, remote-controller.rst, mm/index.rst, numaperf.rst, userfaultfd.rst, module-signing.rst, imx-ddr.rst, intel-speed-select.rst, intel_pstate.rst, ramoops.rst, abi.rst, kernel.rst, vm.rst Signed-off-by: Andrew Klychkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jonathan Corbet <[email protected]> show more ...
Revision tags: v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1
# 62af6964	22-Oct-2020	Fam Zheng <[email protected]>	docs: Add two missing entries in vm sysctl index Both seem overlooked while adding the section in the main content. Signed-off-by: Fam Zheng <[email protected]> Link: https://lore.kernel.org/r/20 docs: Add two missing entries in vm sysctl index Both seem overlooked while adding the section in the main content. Signed-off-by: Fam Zheng <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jonathan Corbet <[email protected]> show more ...
Revision tags: v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1
# facdaa91	12-Aug-2020	Nitin Gupta <[email protected]>	mm: proactive compaction For some applications, we need to allocate almost all memory as hugepages. However, on a running system, higher-order allocations can fail if the memory is fragmented. Linu mm: proactive compaction For some applications, we need to allocate almost all memory as hugepages. However, on a running system, higher-order allocations can fail if the memory is fragmented. Linux kernel currently does on-demand compaction as we request more hugepages, but this style of compaction incurs very high latency. Experiments with one-time full memory compaction (followed by hugepage allocations) show that kernel is able to restore a highly fragmented memory state to a fairly compacted memory state within <1 sec for a 32G system. Such data suggests that a more proactive compaction can help us allocate a large fraction of memory as hugepages keeping allocation latencies low. For a more proactive compaction, the approach taken here is to define a new sysctl called 'vm.compaction_proactiveness' which dictates bounds for external fragmentation which kcompactd tries to maintain. The tunable takes a value in range [0, 100], with a default of 20. Note that a previous version of this patch [1] was found to introduce too many tunables (per-order extfrag{low, high}), but this one reduces them to just one sysctl. Also, the new tunable is an opaque value instead of asking for specific bounds of "external fragmentation", which would have been difficult to estimate. The internal interpretation of this opaque value allows for future fine-tuning. Currently, we use a simple translation from this tunable to [low, high] "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). The score for a node is defined as weighted mean of per-zone external fragmentation. A zone's present_pages determines its weight. To periodically check per-node score, we reuse per-node kcompactd threads, which are woken up every 500 milliseconds to check the same. If a node's score exceeds its high threshold (as derived from user-provided proactiveness value), proactive compaction is started until its score reaches its low threshold value. By default, proactiveness is set to 20, which implies threshold values of low=80 and high=90. This patch is largely based on ideas from Michal Hocko [2]. See also the LWN article [3]. Performance data ================ System: x64_64, 1T RAM, 80 CPU threads. Kernel: 5.6.0-rc3 + this patch echo madvise \| sudo tee /sys/kernel/mm/transparent_hugepage/enabled echo madvise \| sudo tee /sys/kernel/mm/transparent_hugepage/defrag Before starting the driver, the system was fragmented from a userspace program that allocates all memory and then for each 2M aligned section, frees 3/4 of base pages using munmap. The workload is mainly anonymous userspace pages, which are easy to move around. I intentionally avoided unmovable pages in this test to see how much latency we incur when hugepage allocations hit direct compaction. 1. Kernel hugepage allocation latencies With the system in such a fragmented state, a kernel driver then allocates as many hugepages as possible and measures allocation latency: (all latency values are in microseconds) - With vanilla 5.6.0-rc3 percentile latency –––––––––– ––––––– 5 7894 10 9496 25 12561 30 15295 40 18244 50 21229 60 27556 75 30147 80 31047 90 32859 95 33799 Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) - With 5.6.0-rc3 + this patch, with proactiveness=20 sysctl -w vm.compaction_proactiveness=20 percentile latency –––––––––– ––––––– 5 2 10 2 25 3 30 3 40 3 50 4 60 4 75 4 80 4 90 5 95 429 Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) 2. JAVA heap allocation In this test, we first fragment memory using the same method as for (1). Then, we start a Java process with a heap size set to 700G and request the heap to be allocated with THP hugepages. We also set THP to madvise to allow hugepage backing of this heap. /usr/bin/time java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch The above command allocates 700G of Java heap using hugepages. - With vanilla 5.6.0-rc3 17.39user 1666.48system 27:37.89elapsed - With 5.6.0-rc3 + this patch, with proactiveness=20 8.35user 194.58system 3:19.62elapsed Elapsed time remains around 3:15, as proactiveness is further increased. Note that proactive compaction happens throughout the runtime of these workloads. The situation of one-time compaction, sufficient to supply hugepages for following allocation stream, can probably happen for more extreme proactiveness values, like 80 or 90. In the above Java workload, proactiveness is set to 20. The test starts with a node's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As t he benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior ================ Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ [2] https://lore.kernel.org/linux-mm/[email protected]/ [3] https://lwn.net/Articles/817905/ Signed-off-by: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Reviewed-by: Oleksandr Natalenko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Oleksandr Natalenko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]> show more ...
Revision tags: v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3
# 800c02f5	23-Jun-2020	Mauro Carvalho Chehab <[email protected]>	docs: move nommu-mmap.txt to admin-guide and rename to ReST The nommu-mmap.txt file provides description of user visible behaviuour. So, move it to the admin-guide. As it is already at the ReST, al docs: move nommu-mmap.txt to admin-guide and rename to ReST The nommu-mmap.txt file provides description of user visible behaviuour. So, move it to the admin-guide. As it is already at the ReST, also rename it. Suggested-by: Mike Rapoport <[email protected]> Suggested-by: Jonathan Corbet <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]> Link: https://lore.kernel.org/r/3a63d1833b513700755c85bf3bda0a6c4ab56986.1592918949.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <[email protected]> show more ...
Revision tags: v5.8-rc2, v5.8-rc1
# c843966c	03-Jun-2020	Johannes Weiner <[email protected]>	mm: allow swappiness that prefers reclaiming anon over the file workingset With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap devices such as zswap, it's possible for swap to mm: allow swappiness that prefers reclaiming anon over the file workingset With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap devices such as zswap, it's possible for swap to be much faster than filesystems, and for swapping to be preferable over thrashing filesystem caches. Allow setting swappiness - which defines the rough relative IO cost of cache misses between page cache and swap-backed pages - to reflect such situations by making the swap-preferred range configurable. Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]> show more ...
12