|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4 |
|
| #
4b7c0857 |
| 25-Apr-2025 |
Kairui Song <[email protected]> |
mm/memory: fix mapcount / refcount sanity check for mTHP reuse
The following WARNING was triggered during swap stress test with mTHP enabled:
[ 6609.335758] ------------[ cut here ]------------ [ 6
mm/memory: fix mapcount / refcount sanity check for mTHP reuse
The following WARNING was triggered during swap stress test with mTHP enabled:
[ 6609.335758] ------------[ cut here ]------------ [ 6609.337758] WARNING: CPU: 82 PID: 755116 at mm/memory.c:3794 do_wp_page+0x1084/0x10e0 [ 6609.340922] Modules linked in: zram virtiofs [ 6609.342699] CPU: 82 UID: 0 PID: 755116 Comm: sh Kdump: loaded Not tainted 6.15.0-rc1+ #1429 PREEMPT(voluntary) [ 6609.347620] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 [ 6609.349909] RIP: 0010:do_wp_page+0x1084/0x10e0 [ 6609.351532] Code: ff ff 48 c7 c6 80 ba 49 82 4c 89 ef e8 95 fd fe ff 0f 0b bd f5 ff ff ff e9 43 fb ff ff 41 83 a9 bc 12 00 00 01 e9 5c fb ff ff <0f> 0b e9 a6 fc ff ff 65 ff 00 f0 48 0f b a 6d 00 1f 0f 83 82 fc ff [ 6609.357959] RSP: 0000:ffffc90002273d40 EFLAGS: 00010287 [ 6609.359915] RAX: 000000000000000f RBX: 0000000000000000 RCX: 000fffffffe00000 [ 6609.362606] RDX: 0000000000000010 RSI: 000055a119ac1000 RDI: ffffea000ae6ec00 [ 6609.365143] RBP: ffffea000ae6ec68 R08: 84000002b9bb1025 R09: 000055a119ab6000 [ 6609.367569] R10: ffff8881caa2ad80 R11: 0000000000000000 R12: ffff8881caa2ad80 [ 6609.370255] R13: ffffea000ae6ec00 R14: 000055a119ac1c9c R15: ffffc90002273dd8 [ 6609.373007] FS: 00007f08e467f740(0000) GS:ffff88a07c214000(0000) knlGS:0000000000000000 [ 6609.375999] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6609.377946] CR2: 000055a119ac1c9c CR3: 00000001adfd6005 CR4: 0000000000770eb0 [ 6609.380376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6609.382853] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6609.385216] PKRU: 55555554 [ 6609.386141] Call Trace: [ 6609.387017] <TASK> [ 6609.387718] ? ___pte_offset_map+0x1b/0x110 [ 6609.389056] __handle_mm_fault+0xa51/0xf00 [ 6609.390363] ? exc_page_fault+0x6a/0x140 [ 6609.391629] handle_mm_fault+0x13d/0x360 [ 6609.392856] do_user_addr_fault+0x2f2/0x7f0 [ 6609.394160] ? sigprocmask+0x77/0xa0 [ 6609.395375] exc_page_fault+0x6a/0x140 [ 6609.396735] asm_exc_page_fault+0x26/0x30 [ 6609.398224] RIP: 0033:0x55a1050bc18b [ 6609.399567] Code: 8b 3f 4d 85 ff 74 40 41 39 5f 18 75 f2 49 8b 7f 08 44 38 27 75 e9 4c 89 c6 4c 89 45 c8 e8 bd 83 fa ff 4c 8b 45 c8 85 c0 75 d5 <41> 83 47 1c 01 48 83 c4 28 4c 89 f8 5b 4 1 5c 41 5d 41 5e 41 5f 5d [ 6609.405971] RSP: 002b:00007ffcf5f37d90 EFLAGS: 00010246 [ 6609.407737] RAX: 0000000000000000 RBX: 00000000182768fa RCX: 0000000000000000 [ 6609.410151] RDX: 00000000000000fa RSI: 000055a105175c7b RDI: 000055a119ac1c60 [ 6609.412606] RBP: 00007ffcf5f37de0 R08: 000055a105175c7b R09: 0000000000000000 [ 6609.414998] R10: 000000004d2dfb5a R11: 0000000000000246 R12: 0000000000000050 [ 6609.417193] R13: 00000000000000fa R14: 000055a119abaf60 R15: 000055a119ac1c80 [ 6609.419268] </TASK> [ 6609.419928] ---[ end trace 0000000000000000 ]---
The WARN_ON here is simply incorrect. The refcount here must be at least the mapcount, not the opposite. Each mapcount must have a corresponding refcount, but the refcount may increase if other components grab the folio, which is acceptable. Meanwhile, having a mapcount larger than refcount is a real problem.
So fix the WARN_ON condition.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 1da190f4d0a6 ("mm: Copy-on-Write (COW) reuse support for PTE-mapped THP") Signed-off-by: Kairui Song <[email protected]> Reported-by: Kairui Song <[email protected]> Closes: https://lore.kernel.org/all/CAMgjq7D+ea3eg9gRCVvRnto3Sv3_H3WVhupX4e=k8T5QAfBHbw@mail.gmail.com/ Suggested-by: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc3 |
|
| #
8bdea2fc |
| 15-Apr-2025 |
David Hildenbrand <[email protected]> |
mm/memory: move sanity checks in do_wp_page() after mapcount vs. refcount stabilization
In __folio_remove_rmap() for RMAP_LEVEL_PMD/RMAP_LEVEL_PUD and with CONFIG_PAGE_MAPCOUNT we first decrement th
mm/memory: move sanity checks in do_wp_page() after mapcount vs. refcount stabilization
In __folio_remove_rmap() for RMAP_LEVEL_PMD/RMAP_LEVEL_PUD and with CONFIG_PAGE_MAPCOUNT we first decrement the folio mapcount (and recompute mapped shared vs. mapped exclusively) to then adjust the entire mapcount.
This means that another process might stumble in do_wp_page() over a PTE-mapped PMD folio that is indicated as "exclusively mapped", but still has an entire mapcount (PMD mapping), because it is racing with the process that is unmapping the folio (PMD mapping). Note that do_wp_page() will back off once it detects the remaining folio reference from the process that is in the process of unmapping the folio.
This will trigger the early VM_WARN_ON_ONCE(folio_entire_mapcount(folio)) check in do_wp_page(), that can easily be reproduced by looping a couple of times over allocating a PMD THP, forking a child where we immediately unmap it again, and writing in the parent concurrently to the THP.
[ 252.738129][T16470] ------------[ cut here ]------------ [ 252.739267][T16470] WARNING: CPU: 3 PID: 16470 at mm/memory.c:3738 do_wp_page+0x2a75/0x2c00 [ 252.740968][T16470] Modules linked in: [ 252.741958][T16470] CPU: 3 UID: 0 PID: 16470 Comm: ... ... [ 252.765841][T16470] <TASK> [ 252.766419][T16470] ? srso_alias_return_thunk+0x5/0xfbef5 [ 252.767558][T16470] ? rcu_is_watching+0x12/0x60 [ 252.768525][T16470] ? srso_alias_return_thunk+0x5/0xfbef5 [ 252.769645][T16470] ? srso_alias_return_thunk+0x5/0xfbef5 [ 252.770778][T16470] ? lock_acquire+0x33/0x80 [ 252.771697][T16470] ? __handle_mm_fault+0x5e8/0x3e40 [ 252.772735][T16470] ? __handle_mm_fault+0x5e8/0x3e40 [ 252.773781][T16470] __handle_mm_fault+0x1869/0x3e40 [ 252.774839][T16470] handle_mm_fault+0x22a/0x640 [ 252.775808][T16470] do_user_addr_fault+0x618/0x1000 [ 252.776847][T16470] exc_page_fault+0x68/0xd0 [ 252.777775][T16470] asm_exc_page_fault+0x26/0x30
While we could adjust the sequence in __folio_remove_rmap(), let's rater move the mapcount sanity checks after the mapcount vs. refcount stabilization phase. With this fix, a simple reproducer is happy.
While at it, convert the two VM_WARN_ON_ONCE() we are moving to VM_WARN_ON_ONCE_FOLIO().
Link: https://lkml.kernel.org/r/[email protected] Fixes: 1da190f4d0a6 ("mm: Copy-on-Write (COW) reuse support for PTE-mapped THP") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: [email protected] Closes: https://lkml.kernel.org/r/[email protected] Reviewed-by: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc2 |
|
| #
a9951993 |
| 09-Apr-2025 |
Kirill A. Shutemov <[email protected]> |
mm: fix apply_to_existing_page_range()
In the case of apply_to_existing_page_range(), apply_to_pte_range() is reached with 'create' set to false. When !create, the loop over the PTE page table is b
mm: fix apply_to_existing_page_range()
In the case of apply_to_existing_page_range(), apply_to_pte_range() is reached with 'create' set to false. When !create, the loop over the PTE page table is broken.
apply_to_pte_range() will only move to the next PTE entry if 'create' is true or if the current entry is not pte_none().
This means that the user of apply_to_existing_page_range() will not have 'fn' called for any entries after the first pte_none() in the PTE page table.
Fix the loop logic in apply_to_pte_range().
There are no known runtime issues from this, but the fix is trivial enough for stable@ even without a known buggy user.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Fixes: be1db4753ee6 ("mm/memory.c: add apply_to_existing_page_range() helper") Cc: Daniel Axtens <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8c56c5db |
| 08-Apr-2025 |
David Hildenbrand <[email protected]> |
mm: (un)track_pfn_copy() fix + doc improvements
We got a late smatch warning and some additional review feedback.
smatch warnings: mm/memory.c:1428 copy_page_range() error: uninitialized symbol '
mm: (un)track_pfn_copy() fix + doc improvements
We got a late smatch warning and some additional review feedback.
smatch warnings: mm/memory.c:1428 copy_page_range() error: uninitialized symbol 'pfn'.
We actually use the pfn only when it is properly initialized; however, we may pass an uninitialized value to a function -- although it will not use it that likely still is UB in C.
So let's just fix it by always initializing pfn in the caller of track_pfn_copy(), and improving the documentation of track_pfn_copy().
While at it, clarify the doc of untrack_pfn_copy(), that internal checks make sure if we actually have to untrack anything.
Link: https://lkml.kernel.org/r/[email protected] Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Closes: https://lore.kernel.org/r/[email protected]/ Reviewed-by: Lorenzo Stoakes <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc1, v6.14 |
|
| #
dc84bc2a |
| 21-Mar-2025 |
David Hildenbrand <[email protected]> |
x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
If track_pfn_copy() fails, we already added the dst VMA to the maple tree. As fork() fails, we'll cleanup the maple tree, and s
x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
If track_pfn_copy() fails, we already added the dst VMA to the maple tree. As fork() fails, we'll cleanup the maple tree, and stumble over the dst VMA for which we neither performed any reservation nor copied any page tables.
Consequently untrack_pfn() will see VM_PAT and try obtaining the PAT information from the page table -- which fails because the page table was not copied.
The easiest fix would be to simply clear the VM_PAT flag of the dst VMA if track_pfn_copy() fails. However, the whole thing is about "simply" clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy() and performed a reservation, but copying the page tables fails, we'll simply clear the VM_PAT flag, not properly undoing the reservation ... which is also wrong.
So let's fix it properly: set the VM_PAT flag only if the reservation succeeded (leaving it clear initially), and undo the reservation if anything goes wrong while copying the page tables: clearing the VM_PAT flag after undoing the reservation.
Note that any copied page table entries will get zapped when the VMA will get removed later, after copy_page_range() succeeded; as VM_PAT is not set then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be happy. Note that leaving these page tables in place without a reservation is not a problem, as we are aborting fork(); this process will never run.
A reproducer can trigger this usually at the first try:
https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c
WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110 Modules linked in: ... CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:get_pat_info+0xf6/0x110 ... Call Trace: <TASK> ... untrack_pfn+0x52/0x110 unmap_single_vma+0xa6/0xe0 unmap_vmas+0x105/0x1f0 exit_mmap+0xf6/0x460 __mmput+0x4b/0x120 copy_process+0x1bf6/0x2aa0 kernel_clone+0xab/0x440 __do_sys_clone+0x66/0x90 do_syscall_64+0x95/0x180
Likely this case was missed in:
d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
... and instead of undoing the reservation we simply cleared the VM_PAT flag.
Keep the documentation of these functions in include/linux/pgtable.h, one place is more than sufficient -- we should clean that up for the other functions like track_pfn_remap/untrack_pfn separately.
Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3") Reported-by: xingwei lee <[email protected]> Reported-by: yuxin wang <[email protected]> Reported-by: Marius Fleischer <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/
show more ...
|
|
Revision tags: v6.14-rc7 |
|
| #
e120d1bc |
| 13-Mar-2025 |
Mike Rapoport (Microsoft) <[email protected]> |
arch, mm: set high_memory in free_area_init()
high_memory defines upper bound on the directly mapped memory. This bound is defined by the beginning of ZONE_HIGHMEM when a system has high memory and
arch, mm: set high_memory in free_area_init()
high_memory defines upper bound on the directly mapped memory. This bound is defined by the beginning of ZONE_HIGHMEM when a system has high memory and by the end of memory otherwise.
All this is known to generic memory management initialization code that can set high_memory while initializing core mm structures.
Add a generic calculation of high_memory to free_area_init() and remove per-architecture calculation except for the architectures that set and use high_memory earlier than that.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Acked-by: Dave Hansen <[email protected]> [x86] Tested-by: Mark Brown <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David S. Miller <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Guo Ren (csky) <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: Johannes Berg <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Madhavan Srinivasan <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russel King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleinxer <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8268af30 |
| 13-Mar-2025 |
Mike Rapoport (Microsoft) <[email protected]> |
arch, mm: set max_mapnr when allocating memory map for FLATMEM
max_mapnr is essentially the size of the memory map for systems that use FLATMEM. There is no reason to calculate it in each and every
arch, mm: set max_mapnr when allocating memory map for FLATMEM
max_mapnr is essentially the size of the memory map for systems that use FLATMEM. There is no reason to calculate it in each and every architecture when it's anyway calculated in alloc_node_mem_map().
Drop setting of max_mapnr from architecture code and set it once in alloc_node_mem_map().
While on it, move definition of mem_map and max_mapnr to mm/mm_init.c so there won't be two copies for MMU and !MMU variants.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Acked-by: Dave Hansen <[email protected]> [x86] Tested-by: Mark Brown <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David S. Miller <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Guo Ren (csky) <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: Johannes Berg <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Madhavan Srinivasan <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russel King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleinxer <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
003fde44 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared()
Let's reuse our new MM ownership tracking infrastructure for large folios to make folio_likely_mapped_shared() never return fa
mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared()
Let's reuse our new MM ownership tracking infrastructure for large folios to make folio_likely_mapped_shared() never return false negatives -- never indicating "not mapped shared" although the folio *is* mapped shared. With that, we can rename it to folio_maybe_mapped_shared() and get rid of the dependency on the mapcount of the first folio page.
The semantics are now arguably clearer: no mixture of "false negatives" and "false positives", only the remaining possibility for "false positives".
Thoroughly document the new semantics. We might now detect that a large folio is "maybe mapped shared" although it *no longer* is -- but once was. Now, if more than two MMs mapped a folio at the same time, and the MM mapping the folio exclusively at the end is not one tracked in the two folio MM slots, we will detect the folio as "maybe mapped shared".
For anonymous folios, usually (except weird corner cases) all PTEs that target a "maybe mapped shared" folio are R/O. As soon as a child process would write to them (iow, actively use them), we would CoW and effectively replace these PTEs. Most cases (below) are not expected to really matter with large anonymous folios for this reason.
Most importantly, there will be no change at all for: * small folios * hugetlb folios * PMD-mapped PMD-sized THPs (single mapping)
This change has the potential to affect existing callers of folio_likely_mapped_shared() -> folio_maybe_mapped_shared():
(1) fs/proc/task_mmu.c: no change (hugetlb)
(2) khugepaged counts PTEs that target shared folios towards max_ptes_shared (default: HPAGE_PMD_NR / 2), meaning we could skip a collapse where we would have previously collapsed. This only applies to anonymous folios and is not expected to matter in practice.
Worth noting that this change sorts out case (A) documented in commit 1bafe96e89f0 ("mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()") by removing the possibility for "false negatives".
(3) MADV_COLD / MADV_PAGEOUT / MADV_FREE will not try splitting PTE-mapped THPs that are considered shared but not fully covered by the requested range, consequently not processing them.
PMD-mapped PMD-sized THP are not affected, or when all PTEs are covered. These functions are usually only called on anon/file folios that are exclusively mapped most of the time (no other file mappings or no fork()), so the "false negatives" are not expected to matter in practice.
(4) mbind() / migrate_pages() / move_pages() will refuse to migrate shared folios unless MPOL_MF_MOVE_ALL is effective (requires CAP_SYS_NICE). We will now reject some folios that could be migrated.
Similar to (3), especially with MPOL_MF_MOVE_ALL, so this is not expected to matter in practice.
Note that cpuset_migrate_mm_workfn() calls do_migrate_pages() with MPOL_MF_MOVE_ALL.
(5) NUMA hinting
mm/migrate.c:migrate_misplaced_folio_prepare() will skip file folios that are probably shared libraries (-> "mapped shared" and executable). This check would have detected it as a shared library at some point (at least 3 MMs mapping it), so detecting it afterwards does not sound wrong (still a shared library). Not expected to matter.
mm/memory.c:numa_migrate_check() will indicate TNF_SHARED in MAP_SHARED file mappings when encountering a shared folio. Similar reasoning, not expected to matter.
mm/mprotect.c:change_pte_range() will skip folios detected as shared in CoW mappings. Similarly, this is not expected to matter in practice, but if it would ever be a problem we could relax that check a bit (e.g., basing it on the average page-mapcount in a folio), because it was only an optimization when many (e.g., 288) processes were mapping the same folios -- see commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages.")
(6) mm/rmap.c:folio_referenced_one() will skip exclusive swapbacked folios in dying processes. Applies to anonymous folios only. Without "false negatives", we'll now skip all actually shared ones. Skipping ones that are actually exclusive won't really matter, it's a pure optimization, and is not expected to matter in practice.
In theory, one can detect the problematic scenario: folio_mapcount() > 0 and no folio MM slot is occupied ("state unknown"). One could reset the MM slots while doing an rmap walk, which migration / folio split already do when setting everything up. Further, when batching PTEs we might naturally learn about a owner (e.g., folio_mapcount() == nr_ptes) and could update the owner. However, we'll defer that until the scenarios where it would really matter are clear.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
1da190f4 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm: Copy-on-Write (COW) reuse support for PTE-mapped THP
Currently, we never end up reusing PTE-mapped THPs after fork. This wasn't really a problem with PMD-sized THPs, because they would have to
mm: Copy-on-Write (COW) reuse support for PTE-mapped THP
Currently, we never end up reusing PTE-mapped THPs after fork. This wasn't really a problem with PMD-sized THPs, because they would have to be PTE-mapped first, but it's getting a problem with smaller THP sizes that are effectively always PTE-mapped.
With our new "mapped exclusively" vs "maybe mapped shared" logic for large folios, implementing CoW reuse for PTE-mapped THPs is straight forward: if exclusively mapped, make sure that all references are from these (our) mappings. Add some helpful comments to explain the details.
CONFIG_TRANSPARENT_HUGEPAGE selects CONFIG_MM_ID. If we spot an anon large folio without CONFIG_TRANSPARENT_HUGEPAGE in that code, something is seriously messed up.
There are plenty of things we can optimize in the future: For example, we could remember that the folio is fully exclusive so we could speedup the next fault further. Also, we could try "faulting around", turning surrounding PTEs that map the same folio writable. But especially the latter might increase COW latency, so it would need further investigation.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
405c4ef7 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: pass dst_vma to folio_dup_file_rmap_pte() and friends
We'll need access to the destination MM when modifying the large mapcount of a non-hugetlb large folios next. So pass in the destinati
mm/rmap: pass dst_vma to folio_dup_file_rmap_pte() and friends
We'll need access to the destination MM when modifying the large mapcount of a non-hugetlb large folios next. So pass in the destination VMA.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5 |
|
| #
38607c62 |
| 28-Feb-2025 |
Alistair Popple <[email protected]> |
fs/dax: properly refcount fs dax pages
Currently fs dax pages are considered free when the refcount drops to one and their refcounts are not increased when mapped via PTEs or decreased when unmapped
fs/dax: properly refcount fs dax pages
Currently fs dax pages are considered free when the refcount drops to one and their refcounts are not increased when mapped via PTEs or decreased when unmapped. This requires special logic in mm paths to detect that these pages should not be properly refcounted, and to detect when the refcount drops to one instead of zero.
On the other hand get_user_pages(), etc. will properly refcount fs dax pages by taking a reference and dropping it when the page is unpinned.
Tracking this special behaviour requires extra PTE bits (eg. pte_devmap) and introduces rules that are potentially confusing and specific to FS DAX pages. To fix this, and to possibly allow removal of the special PTE bits in future, convert the fs dax page refcounts to be zero based and instead take a reference on the page each time it is mapped as is currently the case for normal pages.
This may also allow a future clean-up to remove the pgmap refcounting that is currently done in mm/gup.c.
Link: https://lkml.kernel.org/r/c7d886ad7468a20452ef6e0ddab6cfe220874e7c.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <[email protected]> Reviewed-by: Dan Williams <[email protected]> Tested-by: Alison Schofield <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Asahi Lina <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chunyan Zhang <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: linmiaohe <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Ted Ts'o <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
ec2e0cc6 |
| 28-Feb-2025 |
Alistair Popple <[email protected]> |
mm/memory: add vmf_insert_page_mkwrite()
Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This creates a special devmap PTE entry for the pfn but does not take a reference on the un
mm/memory: add vmf_insert_page_mkwrite()
Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This creates a special devmap PTE entry for the pfn but does not take a reference on the underlying struct page for the mapping. This is because DAX page refcounts are treated specially, as indicated by the presence of a devmap entry.
To allow DAX page refcounts to be managed the same as normal page refcounts introduce vmf_insert_page_mkwrite(). This will take a reference on the underlying page much the same as vmf_insert_page, except it also permits upgrading an existing mapping to be writable if requested/possible.
Link: https://lkml.kernel.org/r/4ce3aa984c060f370105e0bfef1035869578be47.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <[email protected]> Acked-by: David Hildenbrand <[email protected]> Tested-by: Alison Schofield <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Asahi Lina <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chunyan Zhang <[email protected]> Cc: Dan Wiliams <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: linmiaohe <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Ted Ts'o <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
15a64311 |
| 28-Feb-2025 |
Alistair Popple <[email protected]> |
mm/memory: enhance insert_page_into_pte_locked() to create writable mappings
In preparation for using insert_page() for DAX, enhance insert_page_into_pte_locked() to handle establishing writable map
mm/memory: enhance insert_page_into_pte_locked() to create writable mappings
In preparation for using insert_page() for DAX, enhance insert_page_into_pte_locked() to handle establishing writable mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a PTE which bypasses the typical set_pte_range() in finish_fault.
Link: https://lkml.kernel.org/r/f7354fd9c2f5d0c2fa321733039f9f87e791023e.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <[email protected]> Suggested-by: Dan Williams <[email protected]> Reviewed-by: Dan Williams <[email protected]> Acked-by: David Hildenbrand <[email protected]> Tested-by: Alison Schofield <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Asahi Lina <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chunyan Zhang <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: linmiaohe <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Ted Ts'o <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
82ba975e |
| 28-Feb-2025 |
Alistair Popple <[email protected]> |
mm: allow compound zone device pages
Zone device pages are used to represent various type of device memory managed by device drivers. Currently compound zone device pages are not supported. This i
mm: allow compound zone device pages
Zone device pages are used to represent various type of device memory managed by device drivers. Currently compound zone device pages are not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only user of higher order zone device pages and have their own page reference counting.
A future change will unify FS DAX reference counting with normal page reference counting rules and remove the special FS DAX reference counting. Supporting that requires compound zone device pages.
Supporting compound zone device pages requires compound_head() to distinguish between head and tail pages whilst still preserving the special struct page fields that are specific to zone device pages.
A tail page is distinguished by having bit zero being set in page->compound_head, with the remaining bits pointing to the head page. For zone device pages page->compound_head is shared with page->pgmap.
The page->pgmap field must be common to all pages within a folio, even if the folio spans memory sections. Therefore pgmap is the same for both head and tail pages and can be moved into the folio and we can use the standard scheme to find compound_head from a tail page.
Link: https://lkml.kernel.org/r/67055d772e6102accf85161d0b57b0b3944292bf.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <[email protected]> Signed-off-by: Balbir Singh <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Dan Williams <[email protected]> Acked-by: David Hildenbrand <[email protected]> Tested-by: Alison Schofield <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Asahi Lina <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chunyan Zhang <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: linmiaohe <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Ted Ts'o <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
720ba850 |
| 26-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/mmu_notifier: use MMU_NOTIFY_CLEAR in remove_device_exclusive_entry()
Let's limit the use of MMU_NOTIFY_EXCLUSIVE to the case where we convert a present PTE to device-exclusive. For the other ca
mm/mmu_notifier: use MMU_NOTIFY_CLEAR in remove_device_exclusive_entry()
Let's limit the use of MMU_NOTIFY_EXCLUSIVE to the case where we convert a present PTE to device-exclusive. For the other case, we can simply use MMU_NOTIFY_CLEAR, because it really is clearing the device-exclusive entry first, to then install the present entry.
Update the documentation of MMU_NOTIFY_EXCLUSIVE, to document the single use case more thoroughly.
If ever required, we could add a separate MMU_NOTIFY_CLEAR_EXCLUSIVE; for now using MMU_NOTIFY_CLEAR seems to be sufficient.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jérôme Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
2f95381f |
| 26-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/memory: document restore_exclusive_pte()
Let's document how this function is to be used, and why the folio lock is involved.
Link: https://lkml.kernel.org/r/20250226132257.2826043-5-david@redhat
mm/memory: document restore_exclusive_pte()
Let's document how this function is to be used, and why the folio lock is involved.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jérôme Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
248624f9 |
| 26-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/memory: pass folio and pte to restore_exclusive_pte()
Let's pass the folio and the pte to restore_exclusive_pte(), so we can avoid repeated page_folio() and ptep_get(). To do that, pass the pte
mm/memory: pass folio and pte to restore_exclusive_pte()
Let's pass the folio and the pte to restore_exclusive_pte(), so we can avoid repeated page_folio() and ptep_get(). To do that, pass the pte to try_restore_exclusive_pte() and use a folio in there already.
While at it, just avoid the "swp_entry_t entry" variable in try_restore_exclusive_pte() and add a folio-locked check to restore_exclusive_pte().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jérôme Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
db0f6e67 |
| 26-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/memory: remove PageAnonExclusive sanity-check in restore_exclusive_pte()
In commit b832a354d787 ("mm/memory: page_add_anon_rmap() -> folio_add_anon_rmap_pte()") we accidentally changed the sanity
mm/memory: remove PageAnonExclusive sanity-check in restore_exclusive_pte()
In commit b832a354d787 ("mm/memory: page_add_anon_rmap() -> folio_add_anon_rmap_pte()") we accidentally changed the sanity check to essentially ignore anonymous folio by mis-placing the "!" ... but we really always only get anonymous folios in restore_exclusive_pte().
However, in the meantime we removed the separate "writable device-exclusive entries" and always detect if the PTE can be writable using can_change_pte_writable() -- which also consults PageAnonExclusive.
So let's just get rid of this sanity check completely.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jérôme Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4 |
|
| #
86758b50 |
| 18-Feb-2025 |
Ryan Roberts <[email protected]> |
mm/ioremap: pass pgprot_t to ioremap_prot() instead of unsigned long
ioremap_prot() currently accepts pgprot_val parameter as an unsigned long, thus implicitly assuming that pgprot_val and pgprot_t
mm/ioremap: pass pgprot_t to ioremap_prot() instead of unsigned long
ioremap_prot() currently accepts pgprot_val parameter as an unsigned long, thus implicitly assuming that pgprot_val and pgprot_t could never be bigger than unsigned long. But this assumption soon will not be true on arm64 when using D128 pgtables. In 128 bit page table configuration, unsigned long is 64 bit, but pgprot_t is 128 bit.
Passing platform abstracted pgprot_t argument is better as compared to size based data types. Let's change the parameter to directly pass pgprot_t like another similar helper generic_ioremap_prot().
Without this change in place, D128 configuration does not work on arm64 as the top 64 bits gets silently stripped when passing the protection value to this function.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Co-developed-by: Anshuman Khandual <[email protected]> Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Catalin Marinas <[email protected]> [arm64] Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc3 |
|
| #
e49510bf |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: prepare lock_vma_under_rcu() for vma reuse possibility
Once we make vma cache SLAB_TYPESAFE_BY_RCU, it will be possible for a vma to be reused and attached to another mm after lock_vma_under_rcu
mm: prepare lock_vma_under_rcu() for vma reuse possibility
Once we make vma cache SLAB_TYPESAFE_BY_RCU, it will be possible for a vma to be reused and attached to another mm after lock_vma_under_rcu() locks the vma. lock_vma_under_rcu() should ensure that vma_start_read() is using the original mm and after locking the vma it should ensure that vma->vm_mm has not changed from under us.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
f35ab95c |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: replace vm_lock and detached flag with a reference count
rw_semaphore is a sizable structure of 40 bytes and consumes considerable space for each vm_area_struct. However vma_lock has two import
mm: replace vm_lock and detached flag with a reference count
rw_semaphore is a sizable structure of 40 bytes and consumes considerable space for each vm_area_struct. However vma_lock has two important specifics which can be used to replace rw_semaphore with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock because writers first take mmap_lock in write mode. Because of these requirements, full rw_semaphore functionality is not needed and we can replace rw_semaphore and the vma->detached flag with a refcount (vm_refcnt).
When vma is in detached state, vm_refcnt is 0 and only a call to vma_mark_attached() can take it out of this state. Note that unlike before, now we enforce both vma_mark_attached() and vma_mark_detached() to be done only after vma has been write-locked. vma_mark_attached() changes vm_refcnt to 1 to indicate that it has been attached to the vma tree. When a reader takes read lock, it increments vm_refcnt, unless the top usable bit of vm_refcnt (0x40000000) is set, indicating presence of a writer. When writer takes write lock, it sets the top usable bit to indicate its presence. If there are readers, writer will wait using newly introduced mm->vma_writer_wait. Since all writers take mmap_lock in write mode first, there can be only one writer at a time. The last reader to release the lock will signal the writer to wake up. refcount might overflow if there are many competing readers, in which case read-locking will fail. Readers are expected to handle such failures.
In summary: 1. all readers increment the vm_refcnt; 2. writer sets top usable (writer) bit of vm_refcnt; 3. readers cannot increment the vm_refcnt if the writer bit is set; 4. in the presence of readers, writer must wait for the vm_refcnt to drop to 1 (plus the VMA_LOCK_OFFSET writer bit), indicating an attached vma with no readers; 5. vm_refcnt overflow is handled by the readers.
While this vm_lock replacement does not yet result in a smaller vm_area_struct (it stays at 256 bytes due to cacheline alignment), it allows for further size optimization by structure member regrouping to bring the size of vm_area_struct below 192 bytes.
[[email protected]: fix a crash due to vma_end_read() that should have been removed] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Vlastimil Babka <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
45ad9f52 |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: uninline the main body of vma_start_write()
vma_start_write() is used in many places and will grow in size very soon. It is not used in performance critical paths and uninlining it should limit
mm: uninline the main body of vma_start_write()
vma_start_write() is used in many places and will grow in size very soon. It is not used in performance critical paths and uninlining it should limit the future code size growth. No functional changes.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8ef95d8f |
| 13-Feb-2025 |
Suren Baghdasaryan <[email protected]> |
mm: mark vma as detached until it's added into vma tree
Current implementation does not set detached flag when a VMA is first allocated. This does not represent the real state of the VMA, which is
mm: mark vma as detached until it's added into vma tree
Current implementation does not set detached flag when a VMA is first allocated. This does not represent the real state of the VMA, which is detached until it is added into mm's VMA tree. Fix this by marking new VMAs as detached and resetting detached flag only after VMA is added into a tree.
Introduce vma_mark_attached() to make the API more readable and to simplify possible future cleanup when vma->vm_mm might be used to indicate detached vma and vma_mark_attached() will need an additional mm parameter.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Tested-by: Shivank Garg <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Klara Modin <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
f495bd7e |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: keep mapcount untouched for device-exclusive entries
Now that conversion to device-exclusive does no longer perform an rmap walk and all page_vma_mapped_walk() users were taught to properly
mm/rmap: keep mapcount untouched for device-exclusive entries
Now that conversion to device-exclusive does no longer perform an rmap walk and all page_vma_mapped_walk() users were taught to properly handle device-exclusive entries, let's treat device-exclusive entries just as if they would be present, similar to how we handle device-private entries already.
This fixes swapout/migration/split/hwpoison of folios with device-exclusive entries.
We only had to take care of page_vma_mapped_walk() users, because these traditionally assume pte_present(). Other page table walkers already have to handle !pte_present(), and some of them might simply skip them (e.g., MADV_PAGEOUT) if they are not specialized on them. This change doesn't modify the latter.
Note that while folios with device-exclusive PTEs can now get migrated, khugepaged will not collapse a THP if there is device-exclusive PTE. Doing so might also not be desired if the device frequently performs atomics to the same page. Similarly, KSM will never merge order-0 folios that are device-exclusive.
Link: https://lkml.kernel.org/r/[email protected] Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
9a914592 |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
Let's do it just like mprotect write-upgrade or during NUMA-hinting faults on PROT_NONE PTEs: detect if the
mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
Let's do it just like mprotect write-upgrade or during NUMA-hinting faults on PROT_NONE PTEs: detect if the PTE can be writable by using can_change_pte_writable().
Set the PTE only dirty if the folio is dirty: we might not necessarily have a write access, and setting the PTE writable doesn't require setting the PTE dirty.
From a CPU perspective, these entries are clean. So only set the PTE dirty if the folios is dirty.
With this change in place, there is no need to have separate readable and writable device-exclusive entry types, and we'll merge them next separately.
Note that, during fork(), we first convert the device-exclusive entries back to ordinary PTEs, and we only ever allow conversion of writable PTEs to device-exclusive -- only mprotect can currently change them to readable-device-exclusive. Consequently, we always expect PageAnonExclusive(page)==true and can_change_pte_writable()==true, unless we are dealing with soft-dirty tracking or uffd-wp. But reusing can_change_pte_writable() for now is cleaner.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|