History log of /linux-6.15/include/linux/mmzone.h (Results 1 – 25 of 553)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3
# 38448181 16-Apr-2025 Johannes Weiner <[email protected]>

mm: vmscan: restore high-cpu watermark safety in kswapd

Vlastimil points out that commit a211c6550efc ("mm: page_alloc:
defrag_mode kswapd/kcompactd watermarks") switched kswapd from
zone_watermark_

mm: vmscan: restore high-cpu watermark safety in kswapd

Vlastimil points out that commit a211c6550efc ("mm: page_alloc:
defrag_mode kswapd/kcompactd watermarks") switched kswapd from
zone_watermark_ok_safe() to the standard, percpu-cached version of reading
free pages, thus dropping the watermark safety precautions for systems
with high CPU counts (e.g. >212 cpus on 64G). Restore them.

Since zone_watermark_ok_safe() is no longer the right interface, and this
was the last caller of the function anyway, open-code the
zone_page_state_snapshot() conditional and delete the function.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Vlastimil Babka <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Brendan Jackman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.15-rc2, v6.15-rc1
# 4067196a 29-Mar-2025 Kirill A. Shutemov <[email protected]>

mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()

When the last page in the zone is accepted, __accept_page() calls
static_branch_dec(). This function takes cpu_hotplug_lock, which

mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()

When the last page in the zone is accepted, __accept_page() calls
static_branch_dec(). This function takes cpu_hotplug_lock, which can lead
to a deadlock if the allocation occurs during CPU bringup path as
_cpu_up() also takes the lock.

To prevent this deadlock, defer static_branch_dec() to a workqueue.

Call static_branch_dec() only when the workqueue is not yet initialized.
Workqueues are initialized before CPU bring up, so this will not conflict
with the first scenario.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 55ad43e8ba0f ("mm: add a helper to accept page")
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Srikanth Aithal <[email protected]>
Tested-by: Srikanth Aithal <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ashish Kalra <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Edgecombe, Rick P" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "Mike Rapoport (IBM)" <[email protected]>
Cc: Thomas Lendacky <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14
# e452872b 18-Mar-2025 Hao Jia <[email protected]>

mm: vmscan: split proactive reclaim statistics from direct reclaim statistics

Patch series "Adding Proactive Memory Reclaim Statistics".

These two patches are related to proactive memory reclaim.

mm: vmscan: split proactive reclaim statistics from direct reclaim statistics

Patch series "Adding Proactive Memory Reclaim Statistics".

These two patches are related to proactive memory reclaim.

Patch 1 Split proactive reclaim statistics from direct reclaim counters
and introduces new counters: pgsteal_proactive, pgdemote_proactive,
and pgscan_proactive.

Patch 2 Adds pswpin and pswpout items to the cgroup-v2 documentation.


This patch (of 2):

In proactive memory reclaim scenarios, it is necessary to accurately track
proactive reclaim statistics to dynamically adjust the frequency and
amount of memory being reclaimed proactively. Currently, proactive
reclaim is included in direct reclaim statistics, which can make these
direct reclaim statistics misleading.

Therefore, separate proactive reclaim memory from the direct reclaim
counters by introducing new counters: pgsteal_proactive,
pgdemote_proactive, and pgscan_proactive, to avoid confusion with direct
reclaim.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hao Jia <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14-rc7
# 835de376 14-Mar-2025 Nico Pache <[email protected]>

meminfo: add a per node counter for balloon drivers

Patch series "track memory used by balloon drivers", v2.

This series introduces a way to track memory used by balloon drivers.

Add a NR_BALLOON_

meminfo: add a per node counter for balloon drivers

Patch series "track memory used by balloon drivers", v2.

This series introduces a way to track memory used by balloon drivers.

Add a NR_BALLOON_PAGES counter to track how many pages are reclaimed by
the balloon drivers. First add the accounting, then updates the balloon
drivers (virtio, Hyper-V, VMware, Pseries-cmm, and Xen) to maintain this
counter. The virtio, Vmware, and pseries-cmm balloon drivers utilize the
balloon_compaction interface to allocate and free balloon pages. Other
balloon drivers will have to maintain this counter manually.

This makes the information visible in memory reporting interfaces like
/proc/meminfo, show_mem, and OOM reporting.

This provides admins visibility into their VM balloon sizes without
requiring different virtualization tooling. Furthermore, this information
is helpful when debugging an OOM inside a VM.


This patch (of 4):

Add NR_BALLOON_PAGES counter to track memory used by balloon drivers and
expose it through /proc/meminfo and other memory reporting interfaces.

[[email protected]: document Balloon Meminfo entry]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nico Pache <[email protected]>
Cc: Alexander Atanasov <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Juegren Gross <[email protected]>
Cc: Kanchana P Sridhar <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Oleksandr Tyshchenko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Michael Kelley <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# a211c655 13-Mar-2025 Johannes Weiner <[email protected]>

mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

The previous patch added pageblock_order reclaim to kswapd/kcompactd,
which helps, but produces only one block at a time. Allocation stalls a

mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

The previous patch added pageblock_order reclaim to kswapd/kcompactd,
which helps, but produces only one block at a time. Allocation stalls and
THP failure rates are still higher than they could be.

To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change the
watermarking for kswapd & kcompactd: instead of targeting the high
watermark in order-0 pages and checking for one suitable block, simply
require that the high watermark is entirely met in pageblocks.

To this end, track the number of free pages within contiguous pageblocks,
then change pgdat_balanced() and compact_finished() to check watermarks
against this new value.

This further reduces THP latencies and allocation stalls, and improves THP
success rates against the previous patch:

DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.73%)
Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.04%)
Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.23%)
Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.25%)
Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.54%)
THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.81%)
THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.26%)
Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.79%)
Direct compact success 4.53 ( +0.00%) 2.80 ( -31.33%)
Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.66%)
Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.48%)
Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.83%)
Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.30%)
Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.21%)
Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.05%)
Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.36%)
Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.62%)
Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.63%)
Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.41%)
Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.81%)
Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.95%)
Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.95%)
Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.43%)
Swap out 77238.20 ( +0.00%) 110084.33 ( +42.53%)
Swap in 11712.80 ( +0.00%) 24457.00 ( +108.80%)
File refaults 143438.80 ( +0.00%) 188226.93 ( +31.22%)

Also of note is that compaction work overall is reduced. The reason for
this is that when free pageblocks are more readily available, allocations
are also much more likely to get physically placed in LRU order, instead
of being forced to scavenge free space here and there. This means that
reclaim by itself has better chances of freeing up whole blocks, and the
system relies less on compaction.

Comparing all changes to the vanilla kernel:

VANILLA DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%)
Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%)
Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%)
Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%)
THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%)
THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%)
Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%)
Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%)
Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%)
Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%)
Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%)
Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%)
Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%)
Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%)
Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%)
Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%)
Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%)
Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%)
Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%)
Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%)
Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%)
File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)

THP allocation latencies and %sys time are down dramatically.

THP allocation failures are down from nearly 50% to 8.5%. And to recall
previous data points, the success rates are steady and reliable without
the cumulative deterioration of fragmentation events.

Compaction work is down overall. Direct compaction work especially is
drastically reduced. As an aside, its success rate of 4% indicates there
is room for improvement. For now it's good to rely on it less.

Reclaim work is up overall, however direct reclaim work is down. Part of
the increase can be attributed to a higher use of THPs, which due to
internal fragmentation increase the memory footprint. This is not
necessarily an unexpected side-effect for users of THP.

However, taken both points together, there may well be some opportunities
for fine tuning in the reclaim/compaction coordination.

[[email protected]: fix squawks from rebasing]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 61659efd 11-Mar-2025 Gavin Shan <[email protected]>

drivers/base/memory: improve add_boot_memory_block()

Patch series "drivers/base/memory: Two cleanups", v3.

Two cleanups to drivers/base/memory.


This patch (of 2)L

It's unnecessary to count the p

drivers/base/memory: improve add_boot_memory_block()

Patch series "drivers/base/memory: Two cleanups", v3.

Two cleanups to drivers/base/memory.


This patch (of 2)L

It's unnecessary to count the present sections for the specified block
since the block will be added if any section in the block is present.
Besides, for_each_present_section_nr() can be reused as Andrew Morton
suggested.

Improve by using for_each_present_section_nr() and dropping the
unnecessary @section_count.

No functional changes intended.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gavin Shan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Oscar Salvador <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14-rc6, v6.14-rc5
# 82ba975e 28-Feb-2025 Alistair Popple <[email protected]>

mm: allow compound zone device pages

Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are not
supported. This i

mm: allow compound zone device pages

Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are not
supported. This is because MEMORY_DEVICE_FS_DAX pages are the only user
of higher order zone device pages and have their own page reference
counting.

A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference counting.
Supporting that requires compound zone device pages.

Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.

A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head page.
For zone device pages page->compound_head is shared with page->pgmap.

The page->pgmap field must be common to all pages within a folio, even if
the folio spans memory sections. Therefore pgmap is the same for both
head and tail pages and can be moved into the folio and we can use the
standard scheme to find compound_head from a tail page.

Link: https://lkml.kernel.org/r/67055d772e6102accf85161d0b57b0b3944292bf.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Tested-by: Alison Schofield <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Asahi Lina <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Chunyan Zhang <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: linmiaohe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Ted Ts'o <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# d65917c4 28-Feb-2025 Frank van der Linden <[email protected]>

mm/sparse: allow for alternate vmemmap section init at boot

Add functions that are called just before the per-section memmap is
initialized and just before the memmap page structures are initialized

mm/sparse: allow for alternate vmemmap section init at boot

Add functions that are called just before the per-section memmap is
initialized and just before the memmap page structures are initialized.
They are called sparse_vmemmap_init_nid_early and
sparse_vmemmap_init_nid_late, respectively.

This allows for mm subsystems to add calls to initialize memmap and page
structures in a specific way, if using SPARSEMEM_VMEMMAP. Specifically,
hugetlb can pre-HVO bootmem allocated pages that way, so that no time and
resources are wasted on allocating vmemmap pages, only to free them later
(and possibly unnecessarily running the system out of memory in the
process).

Refactor some code and export a few convenience functions for external
use.

In sparse_init_nid, skip any sections that are already initialized, e.g.
they have been initialized by sparse_vmemmap_init_nid_early already.

The hugetlb code to use these functions will be added in a later commit.

Export section_map_size, as any alternate memmap init code will want to
use it.

The internal config option to enable this is SPARSEMEM_VMEMMAP_PREINIT,
which is selected if an architecture-specific option,
ARCH_WANT_HUGETLB_VMEMMAP_PREINIT, is set. In the future, if other
subsystems want to do preinit too, they can do it in a similar fashion.

The internal config option is there because a section flag is used, and
the number of flags available is architecture-dependent (see mmzone.h).
Architecures can decide if there is room for the flag when enabling
options that select SPARSEMEM_VMEMMAP_PREINIT.

Fortunately, as of right now, all sparse vmemmap using architectures do
have room.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Frank van der Linden <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin (Cruise) <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14-rc4
# 8c57b687 22-Feb-2025 Alexei Starovoitov <[email protected]>

mm, bpf: Introduce free_pages_nolock()

Introduce free_pages_nolock() that can free pages without taking locks.
It relies on trylock and can be called from any context.
Since spin_trylock() cannot be

mm, bpf: Introduce free_pages_nolock()

Introduce free_pages_nolock() that can free pages without taking locks.
It relies on trylock and can be called from any context.
Since spin_trylock() cannot be used in PREEMPT_RT from hard IRQ or NMI
it uses lockless link list to stash the pages which will be freed
by subsequent free_pages() from good context.

Do not use llist unconditionally. BPF maps continuously
allocate/free, so we cannot unconditionally delay the freeing to
llist. When the memory becomes free make it available to the
kernel and BPF users right away if possible, and fallback to
llist as the last resort.

Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

show more ...


Revision tags: v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6
# 4d5d14a0 31-Dec-2024 Yu Zhao <[email protected]>

mm/mglru: rework workingset protection

With the aging feedback no longer considering the distribution of folios
in each generation, rework workingset protection to better distribute
folios across MA

mm/mglru: rework workingset protection

With the aging feedback no longer considering the distribution of folios
in each generation, rework workingset protection to better distribute
folios across MAX_NR_GENS. This is achieved by reusing PG_workingset and
PG_referenced/LRU_REFS_FLAGS in a slightly different way.

For folios accessed multiple times through file descriptors, make
lru_gen_inc_refs() set additional bits of LRU_REFS_WIDTH in folio->flags
after PG_referenced, then PG_workingset after LRU_REFS_WIDTH. After all
its bits are set, i.e., LRU_REFS_FLAGS|BIT(PG_workingset), a folio is
lazily promoted into the second oldest generation in the eviction path.
And when folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that
lru_gen_inc_refs() can start over. For this case, LRU_REFS_MASK is only
valid when PG_referenced is set.

For folios accessed multiple times through page tables, folio_update_gen()
from a page table walk or lru_gen_set_refs() from a rmap walk sets
PG_referenced after the accessed bit is cleared for the first time.
Thereafter, those two paths set PG_workingset and promote folios to the
youngest generation. Like folio_inc_gen(), when folio_update_gen() does
that, it also clears PG_referenced. For this case, LRU_REFS_MASK is not
used.

For both of the cases, after PG_workingset is set on a folio, it remains
until this folio is either reclaimed, or "deactivated" by
lru_gen_clear_refs(). It can be set again if lru_gen_test_recent()
returns true upon a refault.

When adding folios to the LRU lists, lru_gen_folio_seq() distributes
them as follows:
+---------------------------------+---------------------------------+
| Accessed thru page tables | Accessed thru file descriptors |
+---------------------------------+---------------------------------+
| PG_active (set while isolated) | |
+----------------+----------------+----------------+----------------+
| PG_workingset | PG_referenced | PG_workingset | LRU_REFS_FLAGS |
+---------------------------------+---------------------------------+
|<--------- MIN_NR_GENS --------->| |
|<-------------------------- MAX_NR_GENS -------------------------->|

After this patch, some typical client and server workloads showed
improvements under heavy memory pressure. For example, Python TPC-C,
which was used to benchmark a different approach [1] to better detect
refault distances, showed a significant decrease in total refaults:

Before After Change
Time (seconds) 10801 10801 0%
Executed (transactions) 41472 43663 +5%
workingset_nodes 109070 120244 +10%
workingset_refault_anon 5019627 7281831 +45%
workingset_refault_file 1294678786 554855564 -57%
workingset_refault_total 1299698413 562137395 -57%

[1] https://lore.kernel.org/[email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Reported-by: Kairui Song <[email protected]>
Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/
Tested-by: Kalesh Singh <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: David Stevens <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 798c0330 31-Dec-2024 Yu Zhao <[email protected]>

mm/mglru: rework aging feedback

The aging feedback is based on both the number of generations and the
distribution of folios in each generation. The number of generations is
currently the distance

mm/mglru: rework aging feedback

The aging feedback is based on both the number of generations and the
distribution of folios in each generation. The number of generations is
currently the distance between max_seq and anon min_seq. This is because
anon min_seq is not allowed to move past file min_seq. The rationale for
that is that file is always evictable whereas anon is not. However, for
use cases where anon is a lot cheaper than file:

1. Anon in the second oldest generation can be a better choice than
file in the oldest generation.

2. A large amount of file in the oldest generation can skew the
distribution, making should_run_aging() return false negative.

Allow anon and file min_seq to move independently, and use solely the
number of generations as the feedback for aging. Specifically, when both
anon and file are evictable, anon min_seq can now be greater than file
min_seq, and therefore the number of generations becomes the distance
between max_seq and min(min_seq[0],min_seq[1]). And should_run_aging()
returns true if and only if the number of generations is less than
MAX_NR_GENS.

As the first step to the final optimization, this change by itself should
not have userspace-visiable effects beyond performance. The next twos
patch will take advantage of this change; the last patch in this series
will better distribute folios across MAX_NR_GENS.

[[email protected]: restore behaviour for systems with swappiness == 200]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Reported-by: David Stevens <[email protected]>
Tested-by: Kalesh Singh <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Kairui Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6
# 05d4532b 01-Nov-2024 Joshua Hahn <[email protected]>

memcg/hugetlb: add hugeTLB counters to memcg

This patch introduces a new counter to memory.stat that tracks hugeTLB
usage, only if hugeTLB accounting is done to memory.current. This feature
is enab

memcg/hugetlb: add hugeTLB counters to memcg

This patch introduces a new counter to memory.stat that tracks hugeTLB
usage, only if hugeTLB accounting is done to memory.current. This feature
is enabled the same way hugeTLB accounting is enabled, via the
memory_hugetlb_accounting mount flag for cgroupsv2.

1. Why is this patch necessary?
Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
hugeTLB usage to memory.current. However, the metric is not reported in
memory.stat. Given that users often interpret memory.stat as a breakdown
of the value reported in memory.current, the disparity between the two
reports can be confusing. This patch solves this problem by including the
metric in memory.stat as well, but only if it is also reported in
memory.current (it would also be confusing if the value was reported in
memory.stat, but not in memory.current)

Aside from the consistency between the two files, we also see benefits in
observability. Userspace might be interested in the hugeTLB footprint of
cgroups for many reasons. For instance, system admins might want to
verify that hugeTLB usage is distributed as expected across tasks: i.e.
memory-intensive tasks are using more hugeTLB pages than tasks that don't
consume a lot of memory, or are seen to fault frequently. Note that this
is separate from wanting to inspect the distribution for limiting purposes
(in which case, hugeTLB controller makes more sense).

2. We already have a hugeTLB controller. Why not use that?
It is true that hugeTLB tracks the exact value that we want. In fact, by
enabling the hugeTLB controller, we get all of the observability benefits
that I mentioned above, and users can check the total hugeTLB usage,
verify if it is distributed as expected, etc.

With this said, there are 2 problems:
(a) They are still not reported in memory.stat, which means the
disparity between the memcg reports are still there.
(b) We cannot reasonably expect users to enable the hugeTLB controller
just for the sake of hugeTLB usage reporting, especially since
they don't have any use for hugeTLB usage enforcing [2].

3. Implementation Details:
In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
regardless of whether memcg accounts hugetlb. mem_cgroup_commit_charge
which is called from alloc_hugetlb_folio will set memcg for the folio only
if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is used, so
lruvec_stat_mod_folio accounts per-memcg hugetlb counters only if the
feature is enabled. Regardless of whether memcg accounts for hugetlb, the
newly added global counter is updated and shown in /proc/vmstat.

The global counter is added because vmstats is the preferred framework for
cgroup stats. It makes stat items consistent between global and cgroups.
It also provides a per-node breakdown, which is useful. Because it does
not use cgroup-specific hooks, we also keep generic MM code separate from
memcg code.

[1] https://lore.kernel.org/all/[email protected]/
[2] Of course, we can't make a new patch for every feature that can be
duplicated. However, since the existing solution of enabling the
hugeTLB controller is an imperfect solution that still leaves a
discrepancy between memory.stat and memory.curent, I think that it
is reasonable to isolate the feature in this case.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joshua Hahn <[email protected]>
Suggested-by: Nhat Pham <[email protected]>
Suggested-by: Shakeel Butt <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Chris Down <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# c928807f 28-Oct-2024 Yu Zhao <[email protected]>

mm/page_alloc: keep track of free highatomic

OOM kills due to vastly overestimated free highatomic reserves were
observed:

... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0

mm/page_alloc: keep track of free highatomic

OOM kills due to vastly overestimated free highatomic reserves were
observed:

... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB

The second line above shows that the OOM kill was due to the following
condition:

free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)

And the third line shows there were no free pages in any
MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type 'H'.
Therefore __zone_watermark_unusable_free() underestimated the usable free
memory by over 1GB, which resulted in the unnecessary OOM kill above.

The comments in __zone_watermark_unusable_free() warns about the potential
risk, i.e.,

If the caller does not have rights to reserves below the min
watermark then subtract the high-atomic reserves. This will
over-estimate the size of the atomic reserve but it avoids a search.

However, it is possible to keep track of free pages in reserved highatomic
pageblocks with a new per-zone counter nr_free_highatomic protected by the
zone lock, to avoid a search when calculating the usable free memory. And
the cost would be minimal, i.e., simple arithmetics in the highatomic
alloc/free/move paths.

Note that since nr_free_highatomic can be relatively small, using a
per-cpu counter might cause too much drift and defeat its purpose, in
addition to the extra memory overhead.

Dependson e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting") - see [1]

[[email protected]: s/if/else if/, per Johannes, stealth whitespace tweak]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected] [1]
Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
Signed-off-by: Yu Zhao <[email protected]>
Reported-by: Link Lin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.12-rc5, v6.12-rc4
# f1001f3d 17-Oct-2024 Wei Xu <[email protected]>

mm/mglru: reset page lru tier bits when activating

When a folio is activated, lru_gen_add_folio() moves the folio to the
youngest generation. But unlike folio_update_gen()/folio_inc_gen(),
lru_gen_

mm/mglru: reset page lru tier bits when activating

When a folio is activated, lru_gen_add_folio() moves the folio to the
youngest generation. But unlike folio_update_gen()/folio_inc_gen(),
lru_gen_add_folio() doesn't reset the folio lru tier bits (LRU_REFS_MASK |
LRU_REFS_FLAGS). This inconsistency can affect how pages are aged via
folio_mark_accessed() (e.g. fd accesses), though no user visible impact
related to this has been detected yet.

Note that lru_gen_add_folio() cannot clear PG_workingset if the activation
is due to workingset refault, otherwise PSI accounting will be skipped.
So fix lru_gen_add_folio() to clear the lru tier bits other than
PG_workingset when activating a folio, and also clear all the lru tier
bits when a folio is activated via folio_activate() in
lru_gen_look_around().

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap")
Signed-off-by: Wei Xu <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Jan Alexander Steffens <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 1d4832be 19-Oct-2024 Yu Zhao <[email protected]>

mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()

When the MM_WALK capability is enabled, memory that is mostly accessed by
a VM appears younger than it really is, therefore this memory will b

mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()

When the MM_WALK capability is enabled, memory that is mostly accessed by
a VM appears younger than it really is, therefore this memory will be less
likely to be evicted. Therefore, the presence of a running VM can
significantly increase swap-outs for non-VM memory, regressing the
performance for the rest of the system.

Fix this regression by always calling {ptep,pmdp}_clear_young_notify()
whenever we clear the young bits on PMDs/PTEs.

[[email protected]: fix link-time error]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
Reported-by: David Stevens <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Matlack <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: <[email protected]>
Cc: kernel test robot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# ddd6d8e9 19-Oct-2024 Yu Zhao <[email protected]>

mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats

Patch series "mm: multi-gen LRU: Have secondary MMUs participate in
MM_WALK".

Today, the MM_WALK capability causes MGLRU to clear th

mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats

Patch series "mm: multi-gen LRU: Have secondary MMUs participate in
MM_WALK".

Today, the MM_WALK capability causes MGLRU to clear the young bit from
PMDs and PTEs during the page table walk before eviction, but MGLRU does
not call the clear_young() MMU notifier in this case. By not calling this
notifier, the MM walk takes less time/CPU, but it causes pages that are
accessed mostly through KVM / secondary MMUs to appear younger than they
should be.

We do call the clear_young() notifier today, but only when attempting to
evict the page, so we end up clearing young/accessed information less
frequently for secondary MMUs than for mm PTEs, and therefore they appear
younger and are less likely to be evicted. Therefore, memory that is
*not* being accessed mostly by KVM will be evicted *more* frequently,
worsening performance.

ChromeOS observed a tab-open latency regression when enabling MGLRU with a
setup that involved running a VM:

Tab-open latency histogram (ms)
Version p50 mean p95 p99 max
base 1315 1198 2347 3454 10319
mglru 2559 1311 7399 12060 43758
fix 1119 926 2470 4211 6947

This series replaces the final non-selftest patchs from this series[1],
which introduced a similar change (and a new MMU notifier) with KVM
optimizations. I'll send a separate series (to Sean and Paolo) for the
KVM optimizations.

This series also makes proactive reclaim with MGLRU possible for KVM
memory. I have verified that this functions correctly with the selftest
from [1], but given that that test is a KVM selftest, I'll send it with
the rest of the KVM optimizations later. Andrew, let me know if you'd
like to take the test now anyway.

[1]: https://lore.kernel.org/linux-mm/[email protected]/


This patch (of 2):

The removed stats, MM_LEAF_OLD and MM_NONLEAF_TOTAL, are not very helpful
and become more complicated to properly compute when adding
test/clear_young() notifiers in MGLRU's mm walk.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Matlack <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: David Stevens <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2
# 620943d7 01-Aug-2024 Andrew Morton <[email protected]>

include/linux/mmzone.h: clean up watermark accessors

- we have a helper wmark_pages(). Teach min_wmark_pages(),
low_wmark_pages(), high_wmark_pages() and promo_wmark_pages() to use
it instead o

include/linux/mmzone.h: clean up watermark accessors

- we have a helper wmark_pages(). Teach min_wmark_pages(),
low_wmark_pages(), high_wmark_pages() and promo_wmark_pages() to use
it instead of open-coding its implementation.

- there's no reason to implement all these things as macros. Redo them
in C.

Acked-by: Johannes Weiner <[email protected]>
Cc: Kaiyang Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 03790c51 01-Aug-2024 Kaiyang Zhao <[email protected]>

mm: create promo_wmark_pages and clean up open-coded sites

Patch series "mm: print the promo watermark in zoneinfo", v2.


This patch (of 2):

Define promo_wmark_pages and convert current call sites

mm: create promo_wmark_pages and clean up open-coded sites

Patch series "mm: print the promo watermark in zoneinfo", v2.


This patch (of 2):

Define promo_wmark_pages and convert current call sites of wmark_pages
with fixed WMARK_PROMO to using it instead.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kaiyang Zhao <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 29943248 29-Jul-2024 Wei Yang <[email protected]>

mm: improve code consistency with zonelist_* helper functions

Replace direct access to zoneref->zone, zoneref->zone_idx, or
zone_to_nid(zoneref->zone) with the corresponding zonelist_* helper
functi

mm: improve code consistency with zonelist_* helper functions

Replace direct access to zoneref->zone, zoneref->zone_idx, or
zone_to_nid(zoneref->zone) with the corresponding zonelist_* helper
functions for consistency.

No functional change.

Link: https://lkml.kernel.org/r/[email protected]
Co-developed-by: Shivank Garg <[email protected]>
Signed-off-by: Shivank Garg <[email protected]>
Signed-off-by: Wei Yang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 9d857311 08-Aug-2024 Pasha Tatashin <[email protected]>

mm: don't account memmap per-node

Fix invalid access to pgdat during hot-remove operation:
ndctl users reported a GPF when trying to destroy a namespace:
$ ndctl destroy-namespace all -r all -f
Seg

mm: don't account memmap per-node

Fix invalid access to pgdat during hot-remove operation:
ndctl users reported a GPF when trying to destroy a namespace:
$ ndctl destroy-namespace all -r all -f
Segmentation fault
dmesg:
Oops: general protection fault, probably for
non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN
PTI
KASAN: probably user-memory-access in range
[0x000000000002b280-0x000000000002b287]
CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1
Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS
2.20.1 09/13/2023
RIP: 0010:mod_node_page_state+0x2a/0x110

cxl-test users report a GPF when trying to unload the test module:
$ modrpobe -r cxl-test
dmesg
BUG: unable to handle page fault for address: 0000000000004200
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197
Tainted: [O]=OOT_MODULE, [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15
RIP: 0010:mod_node_page_state+0x6/0x90

Currently, when memory is hot-plugged or hot-removed the accounting is
done based on the assumption that memmap is allocated from the same node
as the hot-plugged/hot-removed memory, which is not always the case.

In addition, there are challenges with keeping the node id of the memory
that is being remove to the time when memmap accounting is actually
performed: since this is done after remove_pfn_range_from_zone(), and
also after remove_memory_block_devices(). Meaning that we cannot use
pgdat nor walking though memblocks to get the nid.

Given all of that, account the memmap overhead system wide instead.

For this we are going to be using global atomic counters, but given that
memmap size is rarely modified, and normally is only modified either
during early boot when there is only one CPU, or under a hotplug global
mutex lock, therefore there is no need for per-cpu optimizations.

Also, while we are here rename nr_memmap to nr_memmap_pages, and
nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the
units are in page count.

[[email protected]: address a few nits from David Hildenbrand]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 15995a352474 ("mm: report per-page metadata information")
Signed-off-by: Pasha Tatashin <[email protected]>
Reported-by: Yi Zhang <[email protected]>
Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@mail.gmail.com
Reported-by: Alison Schofield <[email protected]>
Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t
Tested-by: Dan Williams <[email protected]>
Tested-by: Alison Schofield <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: David Rientjes <[email protected]>
Tested-by: Yi Zhang <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Fan Ni <[email protected]>
Cc: Joel Granados <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Li Zhijian <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6
# 82f0b6f0 26-Jun-2024 Waiman Long <[email protected]>

mm: prevent derefencing NULL ptr in pfn_section_valid()

Commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
memory_section->usage") changed pfn_section_valid() to add a READ_ONCE()
call around

mm: prevent derefencing NULL ptr in pfn_section_valid()

Commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
memory_section->usage") changed pfn_section_valid() to add a READ_ONCE()
call around "ms->usage" to fix a race with section_deactivate() where
ms->usage can be cleared. The READ_ONCE() call, by itself, is not enough
to prevent NULL pointer dereference. We need to check its value before
dereferencing it.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage")
Signed-off-by: Waiman Long <[email protected]>
Cc: Charan Teja Kalla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.10-rc5, v6.10-rc4, v6.10-rc3
# 15995a35 05-Jun-2024 Sourav Panda <[email protected]>

mm: report per-page metadata information

Today, we do not have any observability of per-page metadata and how much
it takes away from the machine capacity. Thus, we want to describe the
amount of m

mm: report per-page metadata information

Today, we do not have any observability of per-page metadata and how much
it takes away from the machine capacity. Thus, we want to describe the
amount of memory that is going towards per-page metadata, which can vary
depending on build configuration, machine architecture, and system use.

This patch adds 2 fields to /proc/vmstat that can used as shown below:

Accounting per-page metadata allocated by boot-allocator:
/proc/vmstat:nr_memmap_boot * PAGE_SIZE

Accounting per-page metadata allocated by buddy-allocator:
/proc/vmstat:nr_memmap * PAGE_SIZE

Accounting total Perpage metadata allocated on the machine:
(/proc/vmstat:nr_memmap_boot +
/proc/vmstat:nr_memmap) * PAGE_SIZE

Utility for userspace:

Observability: Describe the amount of memory overhead that is going to
per-page metadata on the system at any given time since this overhead is
not currently observable.

Debugging: Tracking the changes or absolute value in struct pages can help
detect anomalies as they can be correlated with other metrics in the
machine (e.g., memtotal, number of huge pages, etc).

page_ext overheads: Some kernel features such as page_owner
page_table_check that use page_ext can be optionally enabled via kernel
parameters. Having the total per-page metadata information helps users
precisely measure impact. Furthermore, page-metadata metrics will reflect
the amount of struct pages reliquished (or overhead reduced) when
hugetlbfs pages are reserved which will vary depending on whether hugetlb
vmemmap optimization is enabled or not.

For background and results see:
lore.kernel.org/all/[email protected]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sourav Panda <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Chen Linxuan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Ivan Babrou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tomas Mudrunka <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Yang Yang <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# bf14ed81 20-Jun-2024 yangge <[email protected]>

mm/page_alloc: Separate THP PCP into movable and non-movable categories

Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations") no longer differentiates the migr

mm/page_alloc: Separate THP PCP into movable and non-movable categories

Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations") no longer differentiates the migration type of
pages in THP-sized PCP list, it's possible that non-movable allocation
requests may get a CMA page from the list, in some cases, it's not
acceptable.

If a large number of CMA memory are configured in system (for example, the
CMA memory accounts for 50% of the system memory), starting a virtual
machine with device passthrough will get stuck. During starting the
virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM,
...) to pin memory. Normally if a page is present and in CMA area,
pin_user_pages_remote() will migrate the page from CMA area to non-CMA
area because of FOLL_LONGTERM flag. But if non-movable allocation
requests return CMA memory, migrate_longterm_unpinnable_pages() will
migrate a CMA page to another CMA page, which will fail to pass the check
in check_and_migrate_movable_pages() and cause migration endless.

Call trace:
pin_user_pages_remote
--__gup_longterm_locked // endless loops in this function
----_get_user_pages_locked
----check_and_migrate_movable_pages
------migrate_longterm_unpinnable_pages
--------alloc_migration_target

This problem will also have a negative impact on CMA itself. For example,
when CMA is borrowed by THP, and we need to reclaim it through cma_alloc()
or dma_alloc_coherent(), we must move those pages out to ensure CMA's
users can retrieve that contigous memory. Currently, CMA's memory is
occupied by non-movable pages, meaning we can't relocate them. As a
result, cma_alloc() is more likely to fail.

To fix the problem above, we add one PCP list for THP, which will not
introduce a new cacheline for struct per_cpu_pages. THP will have 2 PCP
lists, one PCP list is used by MOVABLE allocation, and the other PCP list
is used by UNMOVABLE allocation. MOVABLE allocation contains GPF_MOVABLE,
and UNMOVABLE allocation contains GFP_UNMOVABLE and GFP_RECLAIMABLE.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations")
Signed-off-by: yangge <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4
# 212c5c07 13-Apr-2024 Pasha Tatashin <[email protected]>

iommu: account IOMMU allocated memory

In order to be able to limit the amount of memory that is allocated
by IOMMU subsystem, the memory must be accounted.

Account IOMMU as part of the secondary pa

iommu: account IOMMU allocated memory

In order to be able to limit the amount of memory that is allocated
by IOMMU subsystem, the memory must be accounted.

Account IOMMU as part of the secondary pagetables as it was discussed
at LPC.

The value of SecPageTables now contains mmeory allocation by IOMMU
and KVM.

There is a difference between GFP_ACCOUNT and what NR_IOMMU_PAGES shows.
GFP_ACCOUNT is set only where it makes sense to charge to user
processes, i.e. IOMMU Page Tables, but there more IOMMU shared data
that should not really be charged to a specific process.

Signed-off-by: Pasha Tatashin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Tested-by: Bagas Sanjaya <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>

show more ...


# bd3520a9 13-Apr-2024 Pasha Tatashin <[email protected]>

iommu: observability of the IOMMU allocations

Add NR_IOMMU_PAGES into node_stat_item that counts number of pages
that are allocated by the IOMMU subsystem.

The allocations can be view per-node via:

iommu: observability of the IOMMU allocations

Add NR_IOMMU_PAGES into node_stat_item that counts number of pages
that are allocated by the IOMMU subsystem.

The allocations can be view per-node via:
/sys/devices/system/node/nodeN/vmstat.

For example:

$ grep iommu /sys/devices/system/node/node*/vmstat
/sys/devices/system/node/node0/vmstat:nr_iommu_pages 106025
/sys/devices/system/node/node1/vmstat:nr_iommu_pages 3464

The value is in page-count, therefore, in the above example
the iommu allocations amount to ~428M.

Signed-off-by: Pasha Tatashin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Tested-by: Bagas Sanjaya <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>

show more ...


12345678910>>...23