History log of /linux-6.15/include/linux/mm.h (Results 1 – 25 of 1487)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6
# 0f518255 07-May-2025 Florent Revest <[email protected]>

mm: fix VM_UFFD_MINOR == VM_SHADOW_STACK on USERFAULTFD=y && ARM64_GCS=y

On configs with CONFIG_ARM64_GCS=y, VM_SHADOW_STACK is bit 38. On configs
with CONFIG_HAVE_ARCH_USERFAULTFD_MINOR=y (selecte

mm: fix VM_UFFD_MINOR == VM_SHADOW_STACK on USERFAULTFD=y && ARM64_GCS=y

On configs with CONFIG_ARM64_GCS=y, VM_SHADOW_STACK is bit 38. On configs
with CONFIG_HAVE_ARCH_USERFAULTFD_MINOR=y (selected by CONFIG_ARM64 when
CONFIG_USERFAULTFD=y), VM_UFFD_MINOR is _also_ bit 38.

This bit being shared by two different VMA flags could lead to all sorts
of unintended behaviors. Presumably, a process could maybe call into
userfaultfd in a way that disables the shadow stack vma flag. I can't
think of any attack where this would help (presumably, if an attacker
tries to disable shadow stacks, they are trying to hijack control flow so
can't arbitrarily call into userfaultfd yet anyway) but this still feels
somewhat scary.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: ae80e1629aea ("mm: Define VM_SHADOW_STACK for arm64 when we support GCS")
Signed-off-by: Florent Revest <[email protected]>
Reviewed-by: Mark Brown <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2
# 98b1917c 10-Apr-2025 David Hildenbrand <[email protected]>

fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages

Alison reports an issue with fsdax when large extends end up using large
ZONE_DEVICE folios:

[ 417.796271] BUG: kernel NU

fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages

Alison reports an issue with fsdax when large extends end up using large
ZONE_DEVICE folios:

[ 417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
[ 417.796982] #PF: supervisor read access in kernel mode
[ 417.797540] #PF: error_code(0x0000) - not-present page
[ 417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
[ 417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
[ 417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ...
[ 417.800150] Tainted: [O]=OOT_MODULE
[ 417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
[ 417.801948] Code: ...
[ 417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
[ 417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
[ 417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
[ 417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
[ 417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
[ 417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
[ 417.807801] FS: 00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
[ 417.808570] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
[ 417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 417.811353] Call Trace:
[ 417.811709] <TASK>
[ 417.812038] folio_add_file_rmap_ptes+0x143/0x230
[ 417.812566] insert_page_into_pte_locked+0x1ee/0x3c0
[ 417.813132] insert_page+0x78/0xf0
[ 417.813558] vmf_insert_page_mkwrite+0x55/0xa0
[ 417.814088] dax_fault_iter+0x484/0x7b0
[ 417.814542] dax_iomap_pte_fault+0x1ca/0x620
[ 417.815055] dax_iomap_fault+0x39/0x40
[ 417.815499] __xfs_write_fault+0x139/0x380
[ 417.815995] ? __handle_mm_fault+0x5e5/0x1a60
[ 417.816483] xfs_write_fault+0x41/0x50
[ 417.816966] xfs_filemap_fault+0x3b/0xe0
[ 417.817424] __do_fault+0x31/0x180
[ 417.817859] __handle_mm_fault+0xee1/0x1a60
[ 417.818325] ? debug_smp_processor_id+0x17/0x20
[ 417.818844] handle_mm_fault+0xe1/0x2b0
[...]

The issue is that when we split a large ZONE_DEVICE folio to order-0 ones,
we don't reset the order/_nr_pages. As folio->_nr_pages overlays
page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it
has folio->memcg_data set. And we never manually initialize
folio->memcg_data in fsdax code, because we never expect it to be set at
all.

When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries
to use folio->memcg_data (because it's non-NULL) but it does not actually
point at a memcg, resulting in the problem.

Alison also observed that these folios sometimes have "locked" set, which
is rather concerning (folios locked from the beginning ...). The reason
is that the order for large folios is stored in page[1]->flags, which
become the folio->flags of a new small folio.

Let's fix it by adding a folio helper to clear order/_nr_pages for
splitting purposes.

Maybe we should reinitialize other large folio flags / folio members as
well when splitting, because they might similarly cause harm once page[1]
becomes a folio? At least other flags in PAGE_FLAGS_SECOND should not be
set for fsdax, so at least page[1]->flags might be as expected with this
fix.

From a quick glimpse, initializing ->mapping, ->pgmap and ->share should
re-initialize most things from a previous page[1] used by large folios
that fsdax cares about. For example folio->private might not get
reinitialized, but maybe that's not relevant -- no traces of it's use in
fsdax code. Needs a closer look.

Another thing that should be considered in the future is performing
similar checks as we perform in free_tail_page_prepare()
-- checking pincount etc.
-- when freeing a large fsdax folio.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages")
Signed-off-by: David Hildenbrand <[email protected]>
Reported-by: Alison Schofield <[email protected]>
Closes: https://lkml.kernel.org/r/[email protected]
Tested-by: Alison Schofield <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Tested-by: "Darrick J. Wong" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6
# 5796d396 05-Mar-2025 Jeff Xu <[email protected]>

mseal sysmap: kernel config and header change

Patch series "mseal system mappings", v9.

As discussed during mseal() upstream process [1], mseal() protects the
VMAs of a given virtual memory range a

mseal sysmap: kernel config and header change

Patch series "mseal system mappings", v9.

As discussed during mseal() upstream process [1], mseal() protects the
VMAs of a given virtual memory range against modifications, such as the
read/write (RW) and no-execute (NX) bits. For complete descriptions of
memory sealing, please see mseal.rst [2].

The mseal() is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped.

The system mappings are readonly only, memory sealing can protect them
from ever changing to writable or unmmap/remapped as different attributes.

System mappings such as vdso, vvar, vvar_vclock, vectors (arm
compat-mode), sigpage (arm compat-mode), are created by the kernel during
program initialization, and could be sealed after creation.

Unlike the aforementioned mappings, the uprobe mapping is not established
during program startup. However, its lifetime is the same as the
process's lifetime [3]. It could be sealed from creation.

The vsyscall on x86-64 uses a special address (0xffffffffff600000), which
is outside the mm managed range. This means mprotect, munmap, and mremap
won't work on the vsyscall. Since sealing doesn't enhance the vsyscall's
security, it is skipped in this patch. If we ever seal the vsyscall, it
is probably only for decorative purpose, i.e. showing the 'sl' flag in
the /proc/pid/smaps. For this patch, it is ignored.

It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
alter the system mappings during restore operations. UML(User Mode Linux)
and gVisor, rr are also known to change the vdso/vvar mappings.
Consequently, this feature cannot be universally enabled across all
systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default.

To support mseal of system mappings, architectures must define
CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS and update their special
mappings calls to pass mseal flag. Additionally, architectures must
confirm they do not unmap/remap system mappings during the process
lifetime. The existence of this flag for an architecture implies that it
does not require the remapping of thest system mappings during process
lifetime, so sealing these mappings is safe from a kernel perspective.

This version covers x86-64 and arm64 archiecture as minimum viable feature.

While no specific CPU hardware features are required for enable this
feature on an archiecture, memory sealing requires a 64-bit kernel. Other
architectures can choose whether or not to adopt this feature. Currently,
I'm not aware of any instances in the kernel code that actively
munmap/mremap a system mapping without a request from userspace. The PPC
does call munmap when _install_special_mapping fails for vdso; however,
it's uncertain if this will ever fail for PPC - this needs to be
investigated by PPC in the future [4]. The UML kernel can add this
support when KUnit tests require it [5].

In this version, we've improved the handling of system mapping sealing
from previous versions, instead of modifying the _install_special_mapping
function itself, which would affect all architectures, we now call
_install_special_mapping with a sealing flag only within the specific
architecture that requires it. This targeted approach offers two key
advantages: 1) It limits the code change's impact to the necessary
architectures, and 2) It aligns with the software architecture by keeping
the core memory management within the mm layer, while delegating the
decision of sealing system mappings to the individual architecture, which
is particularly relevant since 32-bit architectures never require sealing.

Prior to this patch series, we explored sealing special mappings from
userspace using glibc's dynamic linker. This approach revealed several
issues:

- The PT_LOAD header may report an incorrect length for vdso, (smaller
than its actual size). The dynamic linker, which relies on PT_LOAD
information to determine mapping size, would then split and partially
seal the vdso mapping. Since each architecture has its own vdso/vvar
code, fixing this in the kernel would require going through each
archiecture. Our initial goal was to enable sealing readonly mappings,
e.g. .text, across all architectures, sealing vdso from kernel since
creation appears to be simpler than sealing vdso at glibc.

- The [vvar] mapping header only contains address information, not
length information. Similar issues might exist for other special
mappings.

- Mappings like uprobe are not covered by the dynamic linker, and there
is no effective solution for them.

This feature's security enhancements will benefit ChromeOS, Android, and
other high security systems.

Testing:
This feature was tested on ChromeOS and Android for both x86-64 and ARM64.
- Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly,
i.e. "sl" shown in the smaps for those mappings, and mremap is blocked.
- Passing various automation tests (e.g. pre-checkin) on ChromeOS and
Android to ensure the sealing doesn't affect the functionality of
Chromebook and Android phone.

I also tested the feature on Ubuntu on x86-64:
- With config disabled, vdso/vvar is not sealed,
- with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK,
normal operations such as browsing the web, open/edit doc are OK.

Link: https://lore.kernel.org/all/[email protected]/ [1]
Link: Documentation/userspace-api/mseal.rst [2]
Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3]
Link: https://lore.kernel.org/all/CABi2SkV6JJwJeviDLsq9N4ONvQ=EFANsiWkgiEOjyT9TQSt+HA@mail.gmail.com/ [4]
Link: https://lore.kernel.org/all/202502251035.239B85A93@keescook/ [5]


This patch (of 7):

Provide infrastructure to mseal system mappings. Establish two kernel
configs (CONFIG_MSEAL_SYSTEM_MAPPINGS,
ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP macro for future
patches.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jeff Xu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Adhemerval Zanella <[email protected]>
Cc: Alexander Mikhalitsyn <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Anna-Maria Behnsen <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Benjamin Berg <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Elliot Hughes <[email protected]>
Cc: Florian Faineli <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jorge Lucangeli Obes <[email protected]>
Cc: Linus Waleij <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pedro Falcato <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Stephen Röttger <[email protected]>
Cc: Thomas Weißschuh <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14-rc5, v6.14-rc4
# 74e2498c 18-Feb-2025 Masami Hiramatsu (Google) <[email protected]>

mm/memblock: Add reserved memory release function

Add reserve_mem_release_by_name() to release a reserved memory region
with a given name. This allows us to release reserved memory which is
defined

mm/memblock: Add reserved memory release function

Add reserve_mem_release_by_name() to release a reserved memory region
with a given name. This allows us to release reserved memory which is
defined by kernel cmdline, after boot.

Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/173989133862.230693.14094993331347437600.stgit@devnote2
Signed-off-by: Steven Rostedt (Google) <[email protected]>

show more ...


# 0d98484e 13-Mar-2025 Mike Rapoport (Microsoft) <[email protected]>

arch, mm: introduce arch_mm_preinit

Currently, implementation of mem_init() in every architecture consists of
one or more of the following:

* initializations that must run before page allocator is

arch, mm: introduce arch_mm_preinit

Currently, implementation of mem_init() in every architecture consists of
one or more of the following:

* initializations that must run before page allocator is active, for
instance swiotlb_init()
* a call to memblock_free_all() to release all the memory to the buddy
allocator
* initializations that must run after page allocator is ready and there is
no arch-specific hook other than mem_init() for that, like for example
register_page_bootmem_info() in x86 and sparc64 or simple setting of
mem_init_done = 1 in several architectures
* a bunch of semi-related stuff that apparently had no better place to
live, for example a ton of BUILD_BUG_ON()s in parisc.

Introduce arch_mm_preinit() that will be the first thing called from
mm_core_init(). On architectures that have initializations that must happen
before the page allocator is ready, move those into arch_mm_preinit() along
with the code that does not depend on ordering with page allocator setup.

On several architectures this results in reduction of mem_init() to a
single call to memblock_free_all() that allows its consolidation next.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: Dave Hansen <[email protected]> [x86]
Tested-by: Mark Brown <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Guo Ren (csky) <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russel King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 6faea342 13-Mar-2025 Mike Rapoport (Microsoft) <[email protected]>

arch, mm: streamline HIGHMEM freeing

All architectures that support HIGHMEM have their code that frees high
memory pages to the buddy allocator while __free_memory_core() is limited
to freeing only

arch, mm: streamline HIGHMEM freeing

All architectures that support HIGHMEM have their code that frees high
memory pages to the buddy allocator while __free_memory_core() is limited
to freeing only low memory.

There is no actual reason for that. The memory map is completely ready by
the time memblock_free_all() is called and high pages can be released to
the buddy allocator along with low memory.

Remove low memory limit from __free_memory_core() and drop per-architecture
code that frees high memory pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: Dave Hansen <[email protected]> [x86]
Tested-by: Mark Brown <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Guo Ren (csky) <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russel King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 8268af30 13-Mar-2025 Mike Rapoport (Microsoft) <[email protected]>

arch, mm: set max_mapnr when allocating memory map for FLATMEM

max_mapnr is essentially the size of the memory map for systems that use
FLATMEM. There is no reason to calculate it in each and every

arch, mm: set max_mapnr when allocating memory map for FLATMEM

max_mapnr is essentially the size of the memory map for systems that use
FLATMEM. There is no reason to calculate it in each and every architecture
when it's anyway calculated in alloc_node_mem_map().

Drop setting of max_mapnr from architecture code and set it once in
alloc_node_mem_map().

While on it, move definition of mem_map and max_mapnr to mm/mm_init.c so
there won't be two copies for MMU and !MMU variants.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: Dave Hansen <[email protected]> [x86]
Tested-by: Mark Brown <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Guo Ren (csky) <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russel King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 003fde44 03-Mar-2025 David Hildenbrand <[email protected]>

mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared()

Let's reuse our new MM ownership tracking infrastructure for large folios
to make folio_likely_mapped_shared() never return fa

mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared()

Let's reuse our new MM ownership tracking infrastructure for large folios
to make folio_likely_mapped_shared() never return false negatives -- never
indicating "not mapped shared" although the folio *is* mapped shared.
With that, we can rename it to folio_maybe_mapped_shared() and get rid of
the dependency on the mapcount of the first folio page.

The semantics are now arguably clearer: no mixture of "false negatives"
and "false positives", only the remaining possibility for "false
positives".

Thoroughly document the new semantics. We might now detect that a large
folio is "maybe mapped shared" although it *no longer* is -- but once was.
Now, if more than two MMs mapped a folio at the same time, and the MM
mapping the folio exclusively at the end is not one tracked in the two
folio MM slots, we will detect the folio as "maybe mapped shared".

For anonymous folios, usually (except weird corner cases) all PTEs that
target a "maybe mapped shared" folio are R/O. As soon as a child process
would write to them (iow, actively use them), we would CoW and effectively
replace these PTEs. Most cases (below) are not expected to really matter
with large anonymous folios for this reason.

Most importantly, there will be no change at all for:
* small folios
* hugetlb folios
* PMD-mapped PMD-sized THPs (single mapping)

This change has the potential to affect existing callers of
folio_likely_mapped_shared() -> folio_maybe_mapped_shared():

(1) fs/proc/task_mmu.c: no change (hugetlb)

(2) khugepaged counts PTEs that target shared folios towards
max_ptes_shared (default: HPAGE_PMD_NR / 2), meaning we could skip a
collapse where we would have previously collapsed. This only applies
to anonymous folios and is not expected to matter in practice.

Worth noting that this change sorts out case (A) documented in
commit 1bafe96e89f0 ("mm/khugepaged: replace page_mapcount() check by
folio_likely_mapped_shared()") by removing the possibility for "false
negatives".

(3) MADV_COLD / MADV_PAGEOUT / MADV_FREE will not try splitting
PTE-mapped THPs that are considered shared but not fully covered by
the requested range, consequently not processing them.

PMD-mapped PMD-sized THP are not affected, or when all PTEs are
covered. These functions are usually only called on anon/file folios
that are exclusively mapped most of the time (no other file mappings
or no fork()), so the "false negatives" are not expected to matter in
practice.

(4) mbind() / migrate_pages() / move_pages() will refuse to migrate
shared folios unless MPOL_MF_MOVE_ALL is effective (requires
CAP_SYS_NICE). We will now reject some folios that could be migrated.

Similar to (3), especially with MPOL_MF_MOVE_ALL, so this is not
expected to matter in practice.

Note that cpuset_migrate_mm_workfn() calls do_migrate_pages() with
MPOL_MF_MOVE_ALL.

(5) NUMA hinting

mm/migrate.c:migrate_misplaced_folio_prepare() will skip file
folios that are probably shared libraries (-> "mapped shared" and
executable). This check would have detected it as a shared library at
some point (at least 3 MMs mapping it), so detecting it afterwards
does not sound wrong (still a shared library). Not expected to
matter.

mm/memory.c:numa_migrate_check() will indicate TNF_SHARED in
MAP_SHARED file mappings when encountering a shared folio. Similar
reasoning, not expected to matter.

mm/mprotect.c:change_pte_range() will skip folios detected as
shared in CoW mappings. Similarly, this is not expected to matter in
practice, but if it would ever be a problem we could relax that check
a bit (e.g., basing it on the average page-mapcount in a folio),
because it was only an optimization when many (e.g., 288) processes
were mapping the same folios -- see commit 859d4adc3415 ("mm: numa: do
not trap faults on shared data section pages.")

(6) mm/rmap.c:folio_referenced_one() will skip exclusive swapbacked
folios in dying processes. Applies to anonymous folios only. Without
"false negatives", we'll now skip all actually shared ones. Skipping
ones that are actually exclusive won't really matter, it's a pure
optimization, and is not expected to matter in practice.

In theory, one can detect the problematic scenario: folio_mapcount() > 0
and no folio MM slot is occupied ("state unknown"). One could reset the
MM slots while doing an rmap walk, which migration / folio split already
do when setting everything up. Further, when batching PTEs we might
naturally learn about a owner (e.g., folio_mapcount() == nr_ptes) and
could update the owner. However, we'll defer that until the scenarios
where it would really matter are clear.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 845d2be6 03-Mar-2025 David Hildenbrand <[email protected]>

mm: move _entire_mapcount in folio to page[2] on 32bit

Let's free up some space on 32bit in page[1] by moving the _pincount to
page[2].

Ordinary folios only use the entire mapcount with PMD mapping

mm: move _entire_mapcount in folio to page[2] on 32bit

Let's free up some space on 32bit in page[1] by moving the _pincount to
page[2].

Ordinary folios only use the entire mapcount with PMD mappings, so order-1
folios don't apply. Similarly, hugetlb folios are always larger than
order-1, turning the entire mapcount essentially unused for all order-1
folios. Moving it to order-1 folios will not change anything.

On 32bit, simply check in folio_entire_mapcount() whether we have an
order-1 folio, and return 0 in that case.

Note that THPs on 32bit are not particularly common (and we don't care too
much about performance), but we want to keep it working reliably, because
likely we want to use large folios there as well in the future,
independent of PMD leaf support.

Once we dynamically allocate "struct folio", the 32bit specifics will go
away again; even small folios could then have a pincount.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 31a31da8 03-Mar-2025 David Hildenbrand <[email protected]>

mm: move _pincount in folio to page[2] on 32bit

Let's free up some space on 32bit in page[1] by moving the _pincount to
page[2].

For order-1 folios (never anon folios!) on 32bit, we will now also u

mm: move _pincount in folio to page[2] on 32bit

Let's free up some space on 32bit in page[1] by moving the _pincount to
page[2].

For order-1 folios (never anon folios!) on 32bit, we will now also use the
GUP_PIN_COUNTING_BIAS approach. A fully-mapped order-1 folio requires 2
references. With GUP_PIN_COUNTING_BIAS being 1024, we'd detect such
folios as "maybe pinned" with 512 full mappings, instead of 1024 for
order-0. As anon folios are out of the picture (which are the most
relevant users of checking for pinnings on *mapped* pages) and we are
talking about 32bit, this is not expected to cause any trouble.

In __dump_page(), copy one additional folio page if we detect a folio with
an order > 1, so we can dump the pincount on order > 1 folios reliably.

Note that THPs on 32bit are not particularly common (and we don't care too
much about performance), but we want to keep it working reliably, because
likely we want to use large folios there as well in the future,
independent of PMD leaf support.

Once we dynamically allocate "struct folio", fortunately the 32bit
specifics will likely go away again; even small folios could then have a
pincount and folio_has_pincount() would essentially always return "true".

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 4996fc54 03-Mar-2025 David Hildenbrand <[email protected]>

mm: let _folio_nr_pages overlay memcg_data in first tail page

Let's free up some more of the "unconditionally available on 64BIT" space
in order-1 folios by letting _folio_nr_pages overlay memcg_dat

mm: let _folio_nr_pages overlay memcg_data in first tail page

Let's free up some more of the "unconditionally available on 64BIT" space
in order-1 folios by letting _folio_nr_pages overlay memcg_data in the
first tail page (second folio page). Consequently, we have the
optimization now whenever we have CONFIG_MEMCG, independent of 64BIT.

We have to make sure that page->memcg on tail pages does not return
"surprises". page_memcg_check() already properly refuses PageTail().
Let's do that earlier in print_page_owner_memcg() to avoid printing wrong
"Slab cache page" information. No other code should touch that field on
tail pages of compound pages.

Reset the "_nr_pages" to 0 when splitting folios, or when freeing them
back to the buddy (to avoid false page->memcg_data "bad page" reports).

Note that in __split_huge_page(), folio_nr_pages() would stop working
already as soon as we start messing with the subpages.

Most kernel configs should have at least CONFIG_MEMCG enabled, even if
disabled at runtime. 64byte "struct memmap" is what we usually have on
64BIT.

While at it, rename "_folio_nr_pages" to "_nr_pages".

Hopefully memdescs / dynamically allocating "strut folio" in the future
will further clean this up, e.g., making _nr_pages available in all
configs and maybe even in small folios. Doing that should be fairly easy
on top of this change.

[[email protected]: make "make htmldoc" happy]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 1ea5212a 03-Mar-2025 David Hildenbrand <[email protected]>

mm: factor out large folio handling from folio_nr_pages() into folio_large_nr_pages()

Let's factor it out into a simple helper function. This helper will also
come in handy when working with code w

mm: factor out large folio handling from folio_nr_pages() into folio_large_nr_pages()

Let's factor it out into a simple helper function. This helper will also
come in handy when working with code where we know that our folio is
large.

While at it, let's consistently return a "long" value from all these
similar functions. Note that we cannot use "unsigned int" (even though
_folio_nr_pages is of that type), because it would break some callers that
do stuff like "-folio_nr_pages()". Both "int" or "unsigned long" would
work as well.

Maybe in the future we'll have the nr_pages readily available for all
large folios, maybe even for small folios, or maybe for none.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 6220ea55 03-Mar-2025 David Hildenbrand <[email protected]>

mm: factor out large folio handling from folio_order() into folio_large_order()

Patch series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT", v3.

Let's add an "easy" w

mm: factor out large folio handling from folio_order() into folio_large_order()

Patch series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT", v3.

Let's add an "easy" way to decide -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a large
folio is "certainly mapped exclusively" into a single MM, or whether it
"maybe mapped shared" into multiple MMs.

Use that information to implement Copy-on-Write reuse, to convert
folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
introduce a kernel config option that lets us not use+maintain per-page
mapcounts in large folios anymore.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work [2], which
implemented a more precise, but also more complicated, way to identify
whether a large folio is "mapped shared" into multiple MMs or "mapped
exclusively" into a single MM.


1 Patch Organization
====================

Patch #1 -> #6: make more room in order-1 folios, so we have two
"unsigned long" available for our purposes

Patch #7 -> #11: preparations

Patch #12: MM owner tracking for large folios

Patch #13: COW reuse for PTE-mapped anon THP

Patch #14: folio_maybe_mapped_shared()

Patch #15 -> #20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT


2 MM owner tracking
===================

We assign each MM a unique ID ("MM ID"), to be able to squeeze more
information in our folios. On 32bit we use 15-bit IDs, on 64bit we use
31-bit IDs.

For each large folios, we now store two MM-ID+mapcount ("slot")
combinations:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit
mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for
both slots.

Paired with the large mapcount, we can reliably identify whether one of
these MMs is the current owner (-> owns all mappings) or even holds all
folio references (-> owns all mappings, and all references are from
mappings).

As long as only two MMs map folio pages at a time, we can reliably and
precisely identify whether a large folio is "mapped shared" or "mapped
exclusively".

Any additional MM that starts mapping the folio while there are no free
slots becomes an "untracked MM". If one such "untracked MM" is the last
one mapping a folio exclusively, we will not detect the folio as "mapped
exclusively" but instead as "maybe mapped shared". (exception: only a
single mapping remains)

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + slots, and
make sure we do keep the machinery fast, to not degrade (un)map
performance drastically: for example, we make sure to only use a single
atomic (when grabbing the bit-spinlock), like we would already perform
when updating the large mapcount.


3 CONFIG_NO_PAGE_MAPCOUNT
=========================

patch #15 -> #20 spell out and document what exactly is affected when not
maintaining the per-page mapcounts in large folios anymore.

Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore
when (un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped. In addition, we'll not detect partially mapped
anonymous folios as such in all cases yet.

Likely less relevant changes include that we might now under-estimate the
USS (Unique Set Size) of a process, but never over-estimate it.

The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to
then slowly make it the only option, as we learn about real-life impacts
and possible ways to mitigate them.


4 Performance
=============

Detailed performance numbers were included in v1 [3], and not that much
changed between v1 and v2.

I did plenty of measurements on different systems in the meantime, that
all revealed slightly different results.

The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code
layout changes on some systems. Especially the fork() benchmark started
being more-shaky-than-before on recent kernels for some reason.

In summary, with my micro-benchmarks:

* Small folios are not impacted.

* CoW performance seems to be mostly unchanged across all folios sizes.

* CoW reuse performance of large folios now matches CoW reuse
performance of small folios, because we now actually implement the CoW
reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65%
reduction in runtime, on an arm64 system I measured ~54% reduction.

* munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw
double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and
up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the
folios, the larger the performance improvement.

* munmao() performance very slightly (couple percent) degrades without
CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there
seems to be no change at all.

* fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw
double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and
up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the
folios, the larger the performance improvement.

* While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be
almost unchanged on some systems, I saw some degradation for smaller
folios on the AmpereOne A192-32X. I did not investigate the details
yet, but I suspect code layout changes or suboptimal code placement /
inlining.

I'm not to worried about the fork() micro-benchmarks for smaller folios
given how shaky the results are lately and by how much we improved fork()
performance recently.

I also ran case-anon-cow-rand and case-anon-cow-seq part of
vm-scalability, to assess the scalability and the impact of the
bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver
4210R CPU revealed no significant changes.

Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev,
which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.

I found no other fitting scalability benchmarks that seem to really hammer
on concurrent mapping/unmapping of large folio pages like
case-anon-cow-seq does.


5 Concerns
==========

5.1 Bit spinlock
----------------

I'm not quite happy about the bit-spinlock, but so far it does not seem to
affect scalability in my measurements.

If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it was
freed.

This would be similar (but slightly different) to the "0,1,2,stopped"
counting idea Willy had at some point. Adding that logic to "stop
tracking" adds more code to the hot path, so I avoided that for now.


5.2 folio_maybe_mapped_shared()
-------------------------------

I documented the change from folio_likely_mapped_shared() to
folio_maybe_mapped_shared() quite extensively. If we run into surprises,
I have some ideas on how to resolve them. For now, I think we should be
fine.


5.3 Added code to map/unmap hot path
------------------------------------

So far, it looks like the added code on the rmap hot path does not really
seem to matter much in the bigger picture. I'd like to further reduce it
(and possibly improve fork() performance further), but I don't easily see
how right now. Well, and I am out of puff ��

Having that said, alternatives I considered (e.g., per-MM per-folio
mapcount) would add a lot more overhead to these hot paths.


6 Future Work
=============

6.1 Large mapcount
------------------

It would be very handy if the large mapcount would count how often folio
pages are actually mapped into page tables: a PMD on x86-64 would count
512 times. Calculating the average per-page mapcount will be easy, and
remapping (PMD->PTE) folios would get even faster.

That would also remove the need for the entire mapcount (except for
PMD-sized folios for memory statistics reasons ...), and allow for mapping
folios larger than PMDs (e.g., 4 MiB) easily.

We likely would also have to take the same number of folio references to
make our folio_mapcount() == folio_ref_count() work, and we'd want to be
able to avoid mapcount+refcount overflows: this could already become an
issue with pte-mapped PUD-sized folios (fsdax).

One approach we discussed in the THP cabal meeting is (1) extending the
mapcount for large folios to 64bit (at least on 64bit systems) and (2)
keeping the refcount at 32bit, but (3) having exactly one reference if the
the mapcount != 0.

It should be doable, but there are some corner cases to consider on the
unmap path; it is something that I will be looking into next.


6.2 hugetlb
-----------

I'd love to make use of the same tracking also for hugetlb.

The real problem is PMD table sharing: getting a page mapped by MM X and
unmapped by MM Y will not work. With mshare, that problem should not
exist (all mapping/unmapping will be routed through the mshare MM).

[1] https://lwn.net/Articles/974223/
[2] https://lore.kernel.org/linux-mm/[email protected]/T/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c


This patch (of 20):

Let's factor it out into a simple helper function. This helper will also
come in handy when working with code where we know that our folio is
large.

Maybe in the future we'll have the order readily available for small and
large folios; in that case, folio_large_order() would simply translate to
folio_order().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 38607c62 28-Feb-2025 Alistair Popple <[email protected]>

fs/dax: properly refcount fs dax pages

Currently fs dax pages are considered free when the refcount drops to one
and their refcounts are not increased when mapped via PTEs or decreased
when unmapped

fs/dax: properly refcount fs dax pages

Currently fs dax pages are considered free when the refcount drops to one
and their refcounts are not increased when mapped via PTEs or decreased
when unmapped. This requires special logic in mm paths to detect that
these pages should not be properly refcounted, and to detect when the
refcount drops to one instead of zero.

On the other hand get_user_pages(), etc. will properly refcount fs dax
pages by taking a reference and dropping it when the page is unpinned.

Tracking this special behaviour requires extra PTE bits (eg. pte_devmap)
and introduces rules that are potentially confusing and specific to FS DAX
pages. To fix this, and to possibly allow removal of the special PTE bits
in future, convert the fs dax page refcounts to be zero based and instead
take a reference on the page each time it is mapped as is currently the
case for normal pages.

This may also allow a future clean-up to remove the pgmap refcounting that
is currently done in mm/gup.c.

Link: https://lkml.kernel.org/r/c7d886ad7468a20452ef6e0ddab6cfe220874e7c.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Tested-by: Alison Schofield <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Asahi Lina <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Chunyan Zhang <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: linmiaohe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Ted Ts'o <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# e5cb2325 28-Feb-2025 Alistair Popple <[email protected]>

mm/gup: don't allow FOLL_LONGTERM pinning of FS DAX pages

Longterm pinning of FS DAX pages should already be disallowed by various
pXX_devmap checks. However a future change will cause these checks

mm/gup: don't allow FOLL_LONGTERM pinning of FS DAX pages

Longterm pinning of FS DAX pages should already be disallowed by various
pXX_devmap checks. However a future change will cause these checks to be
invalid for FS DAX pages so make folio_is_longterm_pinnable() return false
for FS DAX pages.

Link: https://lkml.kernel.org/r/250a31876704b79f7c65b159f3c835e547f052df.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Tested-by: Alison Schofield <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Asahi Lina <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Chunyan Zhang <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: linmiaohe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Ted Ts'o <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# ec2e0cc6 28-Feb-2025 Alistair Popple <[email protected]>

mm/memory: add vmf_insert_page_mkwrite()

Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
creates a special devmap PTE entry for the pfn but does not take a
reference on the un

mm/memory: add vmf_insert_page_mkwrite()

Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
creates a special devmap PTE entry for the pfn but does not take a
reference on the underlying struct page for the mapping. This is because
DAX page refcounts are treated specially, as indicated by the presence of
a devmap entry.

To allow DAX page refcounts to be managed the same as normal page
refcounts introduce vmf_insert_page_mkwrite(). This will take a reference
on the underlying page much the same as vmf_insert_page, except it also
permits upgrading an existing mapping to be writable if
requested/possible.

Link: https://lkml.kernel.org/r/4ce3aa984c060f370105e0bfef1035869578be47.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Tested-by: Alison Schofield <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Asahi Lina <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Chunyan Zhang <[email protected]>
Cc: Dan Wiliams <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: linmiaohe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Ted Ts'o <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 9eb6207b 28-Feb-2025 Frank van der Linden <[email protected]>

mm/sparse: add vmemmap_*_hvo functions

Add a few functions to enable early HVO:

vmemmap_populate_hvo
vmemmap_undo_hvo
vmemmap_wrprotect_hvo

The populate and undo functions are expected to be used

mm/sparse: add vmemmap_*_hvo functions

Add a few functions to enable early HVO:

vmemmap_populate_hvo
vmemmap_undo_hvo
vmemmap_wrprotect_hvo

The populate and undo functions are expected to be used in early init,
from the sparse_init_nid_early() function. The wrprotect function is to
be used, potentially, later.

To implement these functions, mostly re-use the existing compound pages
vmemmap logic used by DAX. vmemmap_populate_address has its argument
changed a bit in this commit: the page structure passed in to be reused in
the mapping is replaced by a PFN and a flag. The flag indicates whether
an extra ref should be taken on the vmemmap page containing the head page
structure. Taking the ref is appropriate to for DAX / ZONE_DEVICE, but
not for HugeTLB HVO.

The HugeTLB vmemmap optimization maps tail page structure pages read-only.
The vmemmap_wrprotect_hvo function that does this is implemented
separately, because it cannot be guaranteed that reserved page structures
will not be write accessed during memory initialization. Even with
CONFIG_DEFERRED_STRUCT_PAGE_INIT, they might still be written to (if they
are at the bottom of a zone). So, vmemmap_populate_hvo leaves the tail
page structure pages RW initially, and then later during initialization,
after memmap init is fully done, vmemmap_wrprotect_hvo must be called to
finish the job.

Subsequent commits will use these functions for early HugeTLB HVO.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Frank van der Linden <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin (Cruise) <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# d65917c4 28-Feb-2025 Frank van der Linden <[email protected]>

mm/sparse: allow for alternate vmemmap section init at boot

Add functions that are called just before the per-section memmap is
initialized and just before the memmap page structures are initialized

mm/sparse: allow for alternate vmemmap section init at boot

Add functions that are called just before the per-section memmap is
initialized and just before the memmap page structures are initialized.
They are called sparse_vmemmap_init_nid_early and
sparse_vmemmap_init_nid_late, respectively.

This allows for mm subsystems to add calls to initialize memmap and page
structures in a specific way, if using SPARSEMEM_VMEMMAP. Specifically,
hugetlb can pre-HVO bootmem allocated pages that way, so that no time and
resources are wasted on allocating vmemmap pages, only to free them later
(and possibly unnecessarily running the system out of memory in the
process).

Refactor some code and export a few convenience functions for external
use.

In sparse_init_nid, skip any sections that are already initialized, e.g.
they have been initialized by sparse_vmemmap_init_nid_early already.

The hugetlb code to use these functions will be added in a later commit.

Export section_map_size, as any alternate memmap init code will want to
use it.

The internal config option to enable this is SPARSEMEM_VMEMMAP_PREINIT,
which is selected if an architecture-specific option,
ARCH_WANT_HUGETLB_VMEMMAP_PREINIT, is set. In the future, if other
subsystems want to do preinit too, they can do it in a similar fashion.

The internal config option is there because a section flag is used, and
the number of flags available is architecture-dependent (see mmzone.h).
Architecures can decide if there is room for the flag when enabling
options that select SPARSEMEM_VMEMMAP_PREINIT.

Fortunately, as of right now, all sparse vmemmap using architectures do
have room.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Frank van der Linden <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin (Cruise) <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 243a75e2 28-Feb-2025 Frank van der Linden <[email protected]>

mm/bootmem_info: export register_page_bootmem_memmap

If other mm code wants to use this function for early memmap inialization
(on the platforms that have it), it should be made available properly,

mm/bootmem_info: export register_page_bootmem_memmap

If other mm code wants to use this function for early memmap inialization
(on the platforms that have it), it should be made available properly, not
just unconditionally in mm.h

Make this function available for such cases.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Frank van der Linden <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin (Cruise) <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14-rc3
# 31041385 13-Feb-2025 Suren Baghdasaryan <[email protected]>

mm: make vma cache SLAB_TYPESAFE_BY_RCU

To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected by
lock_vma_under_rcu().

Cu

mm: make vma cache SLAB_TYPESAFE_BY_RCU

To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected by
lock_vma_under_rcu().

Current checks are sufficient as long as vma is detached before it is
freed. The only place this is not currently happening is in exit_mmap().
Add the missing vma_mark_detached() in exit_mmap().

Another issue which might trick lock_vma_under_rcu() during vma reuse is
vm_area_dup(), which copies the entire content of the vma into a new one,
overriding new vma's vm_refcnt and temporarily making it appear as
attached. This might trick a racing lock_vma_under_rcu() to operate on a
reused vma if it found the vma before it got reused. To prevent this
situation, we should ensure that vm_refcnt stays at detached state (0)
when it is copied and advances to attached state only after it is added
into the vma tree. Introduce vm_area_init_from() which preserves new
vma's vm_refcnt and use it in vm_area_dup(). Since all vmas are in
detached state with no current readers when they are freed,

lock_vma_under_rcu() will not be able to take vm_refcnt after vma got
detached even if vma is reused. vma_mark_attached() in modified to
include a release fence to ensure all stores to the vma happen before
vm_refcnt gets initialized.

Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
vm_area_struct reuse and will minimize the number of call_rcu() calls.

[[email protected]: remove atomic_set_release() usage in tools/]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Shivank Garg <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# e49510bf 13-Feb-2025 Suren Baghdasaryan <[email protected]>

mm: prepare lock_vma_under_rcu() for vma reuse possibility

Once we make vma cache SLAB_TYPESAFE_BY_RCU, it will be possible for a vma
to be reused and attached to another mm after lock_vma_under_rcu

mm: prepare lock_vma_under_rcu() for vma reuse possibility

Once we make vma cache SLAB_TYPESAFE_BY_RCU, it will be possible for a vma
to be reused and attached to another mm after lock_vma_under_rcu() locks
the vma. lock_vma_under_rcu() should ensure that vma_start_read() is
using the original mm and after locking the vma it should ensure that
vma->vm_mm has not changed from under us.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Shivank Garg <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# e218d9fe 13-Feb-2025 Suren Baghdasaryan <[email protected]>

mm: remove extra vma_numab_state_init() call

vma_init() already memset's the whole vm_area_struct to 0, so there is no
need to an additional vma_numab_state_init().

Link: https://lkml.kernel.org/r/

mm: remove extra vma_numab_state_init() call

vma_init() already memset's the whole vm_area_struct to 0, so there is no
need to an additional vma_numab_state_init().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Tested-by: Shivank Garg <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# f35ab95c 13-Feb-2025 Suren Baghdasaryan <[email protected]>

mm: replace vm_lock and detached flag with a reference count

rw_semaphore is a sizable structure of 40 bytes and consumes considerable
space for each vm_area_struct. However vma_lock has two import

mm: replace vm_lock and detached flag with a reference count

rw_semaphore is a sizable structure of 40 bytes and consumes considerable
space for each vm_area_struct. However vma_lock has two important
specifics which can be used to replace rw_semaphore with a simpler
structure:

1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.

2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode. Because of these
requirements, full rw_semaphore functionality is not needed and we can
replace rw_semaphore and the vma->detached flag with a refcount
(vm_refcnt).

When vma is in detached state, vm_refcnt is 0 and only a call to
vma_mark_attached() can take it out of this state. Note that unlike
before, now we enforce both vma_mark_attached() and vma_mark_detached() to
be done only after vma has been write-locked. vma_mark_attached() changes
vm_refcnt to 1 to indicate that it has been attached to the vma tree.
When a reader takes read lock, it increments vm_refcnt, unless the top
usable bit of vm_refcnt (0x40000000) is set, indicating presence of a
writer. When writer takes write lock, it sets the top usable bit to
indicate its presence. If there are readers, writer will wait using newly
introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
mode first, there can be only one writer at a time. The last reader to
release the lock will signal the writer to wake up. refcount might
overflow if there are many competing readers, in which case read-locking
will fail. Readers are expected to handle such failures.

In summary:
1. all readers increment the vm_refcnt;
2. writer sets top usable (writer) bit of vm_refcnt;
3. readers cannot increment the vm_refcnt if the writer bit is set;
4. in the presence of readers, writer must wait for the vm_refcnt to drop
to 1 (plus the VMA_LOCK_OFFSET writer bit), indicating an attached vma
with no readers;
5. vm_refcnt overflow is handled by the readers.

While this vm_lock replacement does not yet result in a smaller
vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
allows for further size optimization by structure member regrouping to
bring the size of vm_area_struct below 192 bytes.

[[email protected]: fix a crash due to vma_end_read() that should have been removed]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Tested-by: Shivank Garg <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 45ad9f52 13-Feb-2025 Suren Baghdasaryan <[email protected]>

mm: uninline the main body of vma_start_write()

vma_start_write() is used in many places and will grow in size very soon.
It is not used in performance critical paths and uninlining it should
limit

mm: uninline the main body of vma_start_write()

vma_start_write() is used in many places and will grow in size very soon.
It is not used in performance critical paths and uninlining it should
limit the future code size growth. No functional changes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Tested-by: Shivank Garg <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 7440adb4 13-Feb-2025 Suren Baghdasaryan <[email protected]>

mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail

With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
possibility of vma_start_read_locked/vma_start_read_lock

mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail

With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
possibility of vma_start_read_locked/vma_start_read_locked_nested failing
due to refcount overflow. Prepare for such possibility by changing these
APIs and adjusting their users.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Tested-by: Shivank Garg <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


12345678910>>...60