|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3 |
|
| #
2d900eff |
| 18-Apr-2025 |
Davidlohr Bueso <[email protected]> |
mm/migrate: fix sleep in atomic for large folios and buffer heads
The large folio + buffer head noref migration scenarios are being naughty and blocking while holding a spinlock.
As a consequence o
mm/migrate: fix sleep in atomic for large folios and buffer heads
The large folio + buffer head noref migration scenarios are being naughty and blocking while holding a spinlock.
As a consequence of the pagecache lookup path taking the folio lock this serializes against migration paths, so they can wait for each other. For the private_lock atomic case, a new BH_Migrate flag is introduced which enables the lookup to bail.
This allows the critical region of the private_lock on the migration path to be reduced to the way it was before ebdf4de5642fb6 ("mm: migrate: fix reference check race between __find_get_block() and migration"), that is covering the count checks.
The scope is always noref migration.
Reported-by: kernel test robot <[email protected]> Reported-by: [email protected] Closes: https://lore.kernel.org/oe-lkp/[email protected] Fixes: 3c20917120ce61 ("block/bdev: enable large folio support for large logical block sizes") Reviewed-by: Jan Kara <[email protected]> Co-developed-by: Luis Chamberlain <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/[email protected]/ # [1] Link: https://lore.kernel.org/[email protected] Tested-by: [email protected] # [0] [1] Reviewed-by: Luis Chamberlain <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7 |
|
| #
5d89666b |
| 10-Mar-2025 |
Ryan Roberts <[email protected]> |
mm: use ptep_get() instead of directly dereferencing pte_t*
It is best practice for all pte accesses to go via the arch helpers, to ensure non-torn values and to allow the arch to intervene where ne
mm: use ptep_get() instead of directly dereferencing pte_t*
It is best practice for all pte accesses to go via the arch helpers, to ensure non-torn values and to allow the arch to intervene where needed (contpte for arm64 for example). While in this case it was probably safe to directly dereference, let's tidy it up for consistency.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Qi Zheng <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Reviewed-by: Dev Jain <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
003fde44 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared()
Let's reuse our new MM ownership tracking infrastructure for large folios to make folio_likely_mapped_shared() never return fa
mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared()
Let's reuse our new MM ownership tracking infrastructure for large folios to make folio_likely_mapped_shared() never return false negatives -- never indicating "not mapped shared" although the folio *is* mapped shared. With that, we can rename it to folio_maybe_mapped_shared() and get rid of the dependency on the mapcount of the first folio page.
The semantics are now arguably clearer: no mixture of "false negatives" and "false positives", only the remaining possibility for "false positives".
Thoroughly document the new semantics. We might now detect that a large folio is "maybe mapped shared" although it *no longer* is -- but once was. Now, if more than two MMs mapped a folio at the same time, and the MM mapping the folio exclusively at the end is not one tracked in the two folio MM slots, we will detect the folio as "maybe mapped shared".
For anonymous folios, usually (except weird corner cases) all PTEs that target a "maybe mapped shared" folio are R/O. As soon as a child process would write to them (iow, actively use them), we would CoW and effectively replace these PTEs. Most cases (below) are not expected to really matter with large anonymous folios for this reason.
Most importantly, there will be no change at all for: * small folios * hugetlb folios * PMD-mapped PMD-sized THPs (single mapping)
This change has the potential to affect existing callers of folio_likely_mapped_shared() -> folio_maybe_mapped_shared():
(1) fs/proc/task_mmu.c: no change (hugetlb)
(2) khugepaged counts PTEs that target shared folios towards max_ptes_shared (default: HPAGE_PMD_NR / 2), meaning we could skip a collapse where we would have previously collapsed. This only applies to anonymous folios and is not expected to matter in practice.
Worth noting that this change sorts out case (A) documented in commit 1bafe96e89f0 ("mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()") by removing the possibility for "false negatives".
(3) MADV_COLD / MADV_PAGEOUT / MADV_FREE will not try splitting PTE-mapped THPs that are considered shared but not fully covered by the requested range, consequently not processing them.
PMD-mapped PMD-sized THP are not affected, or when all PTEs are covered. These functions are usually only called on anon/file folios that are exclusively mapped most of the time (no other file mappings or no fork()), so the "false negatives" are not expected to matter in practice.
(4) mbind() / migrate_pages() / move_pages() will refuse to migrate shared folios unless MPOL_MF_MOVE_ALL is effective (requires CAP_SYS_NICE). We will now reject some folios that could be migrated.
Similar to (3), especially with MPOL_MF_MOVE_ALL, so this is not expected to matter in practice.
Note that cpuset_migrate_mm_workfn() calls do_migrate_pages() with MPOL_MF_MOVE_ALL.
(5) NUMA hinting
mm/migrate.c:migrate_misplaced_folio_prepare() will skip file folios that are probably shared libraries (-> "mapped shared" and executable). This check would have detected it as a shared library at some point (at least 3 MMs mapping it), so detecting it afterwards does not sound wrong (still a shared library). Not expected to matter.
mm/memory.c:numa_migrate_check() will indicate TNF_SHARED in MAP_SHARED file mappings when encountering a shared folio. Similar reasoning, not expected to matter.
mm/mprotect.c:change_pte_range() will skip folios detected as shared in CoW mappings. Similarly, this is not expected to matter in practice, but if it would ever be a problem we could relax that check a bit (e.g., basing it on the average page-mapcount in a folio), because it was only an optimization when many (e.g., 288) processes were mapping the same folios -- see commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages.")
(6) mm/rmap.c:folio_referenced_one() will skip exclusive swapbacked folios in dying processes. Applies to anonymous folios only. Without "false negatives", we'll now skip all actually shared ones. Skipping ones that are actually exclusive won't really matter, it's a pure optimization, and is not expected to matter in practice.
In theory, one can detect the problematic scenario: folio_mapcount() > 0 and no folio MM slot is occupied ("state unknown"). One could reset the MM slots while doing an rmap walk, which migration / folio split already do when setting everything up. Further, when batching PTEs we might naturally learn about a owner (e.g., folio_mapcount() == nr_ptes) and could update the owner. However, we'll defer that until the scenarios where it would really matter are clear.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
e92b6e7b |
| 07-Feb-2025 |
Lorenzo Stoakes <[email protected]> |
mm: use READ/WRITE_ONCE() for vma->vm_flags on migrate, mprotect
According to the syzbot report referenced here, it is possible to encounter a race between mprotect() writing to the vma->vm_flags fi
mm: use READ/WRITE_ONCE() for vma->vm_flags on migrate, mprotect
According to the syzbot report referenced here, it is possible to encounter a race between mprotect() writing to the vma->vm_flags field and migration checking whether the VMA is locked.
There is no real problem with timing here per se, only that torn reads/writes may occur. Therefore, as a proximate fix, ensure both operations READ_ONCE() and WRITE_ONCE() to avoid this.
This race is possible due to the ability to look up VMAs via the rmap, which migration does in this case, which takes no mmap or VMA lock and therefore does not preclude an operation to modify a VMA.
When the final update of VMA flags is performed by mprotect, this will cause the rmap lock to be taken while the VMA is inserted on split/merge.
However the means by which we perform splits/merges in the kernel is that we perform the split/merge operation on the VMA, acquiring/releasing locks as needed, and only then, after having done so, modifying fields.
We should carefully examine and determine whether we can combine the two operations so as to avoid such races, and whether it might be possible to otherwise annotate these rmap field accesses.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lorenzo Stoakes <[email protected]> Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected]/ Cc: Jann Horn <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
60cf233b |
| 05-Mar-2025 |
Zi Yan <[email protected]> |
mm/migrate: fix shmem xarray update during migration
A shmem folio can be either in page cache or in swap cache, but not at the same time. Namely, once it is in swap cache, folio->mapping should be
mm/migrate: fix shmem xarray update during migration
A shmem folio can be either in page cache or in swap cache, but not at the same time. Namely, once it is in swap cache, folio->mapping should be NULL, and the folio is no longer in a shmem mapping.
In __folio_migrate_mapping(), to determine the number of xarray entries to update, folio_test_swapbacked() is used, but that conflates shmem in page cache case and shmem in swap cache case. It leads to xarray multi-index entry corruption, since it turns a sibling entry to a normal entry during xas_store() (see [1] for a userspace reproduction). Fix it by only using folio_test_swapcache() to determine whether xarray is storing swap cache entries or not to choose the right number of xarray entries to update.
[1] https://lore.kernel.org/linux-mm/[email protected]/
Note: In __split_huge_page(), folio_test_anon() && folio_test_swapcache() is used to get swap_cache address space, but that ignores the shmem folio in swap cache case. It could lead to NULL pointer dereferencing when a in-swap-cache shmem folio is split at __xa_store(), since !folio_test_anon() is true and folio->mapping is NULL. But fortunately, its caller split_huge_page_to_list_to_order() bails out early with EBUSY when folio->mapping is NULL. So no need to take care of it here.
Link: https://lkml.kernel.org/r/[email protected] Fixes: fc346d0a70a1 ("mm: migrate high-order folios in swap cache correctly") Signed-off-by: Zi Yan <[email protected]> Reported-by: Liu Shixin <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Suggested-by: Hugh Dickins <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: Charan Teja Kalla <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3 |
|
| #
f752e677 |
| 08-Aug-2024 |
Byungchul Park <[email protected]> |
mm: separate move/undo parts from migrate_pages_batch()
Functionally, no change. This is a preparation for luf mechanism that requires to use separated folio lists for its own handling during migra
mm: separate move/undo parts from migrate_pages_batch()
Functionally, no change. This is a preparation for luf mechanism that requires to use separated folio lists for its own handling during migration. Refactored migrate_pages_batch() so as to separate move/undo parts from migrate_pages_batch().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Byungchul Park <[email protected]> Reviewed-by: Shivank Garg <[email protected]> Reviewed-by: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
b235448e |
| 13-Jan-2025 |
David Hildenbrand <[email protected]> |
mm/hugetlb: rename folio_putback_active_hugetlb() to folio_putback_hugetlb()
Now that folio_putback_hugetlb() is only called on folios that were previously isolated through folio_isolate_hugetlb(),
mm/hugetlb: rename folio_putback_active_hugetlb() to folio_putback_hugetlb()
Now that folio_putback_hugetlb() is only called on folios that were previously isolated through folio_isolate_hugetlb(), let's rename it to match folio_putback_lru().
Add some kernel doc to clarify how this function is supposed to be used.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Sidhartha Kumar <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
ba23f58d |
| 13-Jan-2025 |
David Hildenbrand <[email protected]> |
mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb folio
We replaced a simple put_page() by a putback_active_hugepage() call in commit 3aaa76e125c1 ("mm: migrate: hugetlb: putback
mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb folio
We replaced a simple put_page() by a putback_active_hugepage() call in commit 3aaa76e125c1 ("mm: migrate: hugetlb: putback destination hugepage to active list"), to set the "active" flag on the dst hugetlb folio.
Nowadays, we decoupled the "active" list from the flag, by calling the flag "migratable".
Calling "putback" on something that wasn't allocated is weird and not future proof, especially if we might reach that path when migration failed and we just want to free the freshly allocated hugetlb folio.
Let's simply handle the migratable flag and the active list flag in move_hugetlb_state(), where we know that allocation succeeded and already handle the temporary flag; use a simple folio_put() to return our reference.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Sidhartha Kumar <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
4c640f12 |
| 13-Jan-2025 |
David Hildenbrand <[email protected]> |
mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb()
Let's make the function name match "folio_isolate_lru()", and add some kernel doc.
Link: https://lkml.kernel.org/r/20250113131611.255
mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb()
Let's make the function name match "folio_isolate_lru()", and add some kernel doc.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Muchun Song <[email protected]> Cc: Sidhartha Kumar <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8c6e2d12 |
| 10-Dec-2024 |
Hyeonggon Yoo <[email protected]> |
mm/migrate: remove slab checks in isolate_movable_page()
Commit 8b8817630ae8 ("mm/migrate: make isolate_movable_page() skip slab pages") introduced slab checks to prevent mis-identification of slab
mm/migrate: remove slab checks in isolate_movable_page()
Commit 8b8817630ae8 ("mm/migrate: make isolate_movable_page() skip slab pages") introduced slab checks to prevent mis-identification of slab pages as movable kernel pages.
However, after Matthew's frozen folio series, these slab checks became unnecessary as the migration logic fails to increase the reference count for frozen slab folios. Remove these redundant slab checks and associated memory barriers.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hyeonggon Yoo <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
bfc1d178 |
| 26-Nov-2024 |
Donet Tom <[email protected]> |
mm: migrate: remove unused argument vma from migrate_misplaced_folio()
Commit ee86814b0562 ("mm/migrate: move NUMA hinting fault folio isolation + checks under PTL") removed the code that had used t
mm: migrate: remove unused argument vma from migrate_misplaced_folio()
Commit ee86814b0562 ("mm/migrate: move NUMA hinting fault folio isolation + checks under PTL") removed the code that had used the vma argument in migrate_misplaced_folio.
Since the vma argument was no longer used in migrate_misplaced_folio, this patch removes it.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Donet Tom <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Zi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Ritesh Harjani (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
51f43d5d |
| 29-Nov-2024 |
David Wang <[email protected]> |
mm/codetag: swap tags when migrate pages
Current solution to adjust codetag references during page migration is done in 3 steps:
1. sets the codetag reference of the old page as empty (not pointing
mm/codetag: swap tags when migrate pages
Current solution to adjust codetag references during page migration is done in 3 steps:
1. sets the codetag reference of the old page as empty (not pointing to any codetag);
2. subtracts counters of the new page to compensate for its own allocation;
3. sets codetag reference of the new page to point to the codetag of the old page.
This does not work if CONFIG_MEM_ALLOC_PROFILING_DEBUG=n because set_codetag_empty() becomes NOOP. Instead, let's simply swap codetag references so that the new page is referencing the old codetag and the old page is referencing the new codetag. This way accounting stays valid and the logic makes more sense.
Link: https://lkml.kernel.org/r/[email protected] Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: David Wang <[email protected]> Closes: https://lore.kernel.org/lkml/[email protected]/ Acked-by: Suren Baghdasaryan <[email protected]> Suggested-by: Suren Baghdasaryan <[email protected]> Acked-by: Yu Zhao <[email protected]> Cc: Kent Overstreet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
5bb6345c |
| 18-Oct-2024 |
Dev Jain <[email protected]> |
mm: remove redundant condition for THP folio
folio_test_pmd_mappable() implies folio_test_large(), therefore, simplify the expression for is_thp.
Link: https://lkml.kernel.org/r/20241018094151.3458
mm: remove redundant condition for THP folio
folio_test_pmd_mappable() implies folio_test_large(), therefore, simplify the expression for is_thp.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Dev Jain <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: "Huang, Ying" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
473c3712 |
| 26-Sep-2024 |
Zhaoyang Huang <[email protected]> |
mm: migrate LRU_REFS_MASK bits in folio_migrate_flags
Bits of LRU_REFS_MASK are not inherited during migration which lead to new folio start from tier0 when MGLRU enabled. Try to bring as much bits
mm: migrate LRU_REFS_MASK bits in folio_migrate_flags
Bits of LRU_REFS_MASK are not inherited during migration which lead to new folio start from tier0 when MGLRU enabled. Try to bring as much bits of folio->flags as possible since compaction and alloc_contig_range which introduce migration do happen at times.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhaoyang Huang <[email protected]> Suggested-by: Yu Zhao <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
f8f931bb |
| 27-Oct-2024 |
Hugh Dickins <[email protected]> |
mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "
mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting.
Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()).
Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately).
Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data.
That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace.
Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment.
Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case.
The 87eaceb3faa5 commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false.
Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Usama Arif <[email protected]> Cc: Wei Yang <[email protected]> Cc: Zi Yan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
35e41024 |
| 25-Oct-2024 |
Gregory Price <[email protected]> |
vmscan,migrate: fix page count imbalance on node stats when demoting pages
When numa balancing is enabled with demotion, vmscan will call migrate_pages when shrinking LRUs. migrate_pages will decre
vmscan,migrate: fix page count imbalance on node stats when demoting pages
When numa balancing is enabled with demotion, vmscan will call migrate_pages when shrinking LRUs. migrate_pages will decrement the the node's isolated page count, leading to an imbalanced count when invoked from (MG)LRU code.
The result is dmesg output like such:
$ cat /proc/sys/vm/stat_refresh
[77383.088417] vmstat_refresh: nr_isolated_anon -103212 [77383.088417] vmstat_refresh: nr_isolated_file -899642
This negative value may impact compaction and reclaim throttling.
The following path produces the decrement:
shrink_folio_list demote_folio_list migrate_pages migrate_pages_batch migrate_folio_move migrate_folio_done mod_node_page_state(-ve) <- decrement
This path happens for SUCCESSFUL migrations, not failures. Typically callers to migrate_pages are required to handle putback/accounting for failures, but this is already handled in the shrink code.
When accounting for migrations, instead do not decrement the count when the migration reason is MR_DEMOTION. As of v6.11, this demotion logic is the only source of MR_DEMOTION.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim") Signed-off-by: Gregory Price <[email protected]> Reviewed-by: Yang Shi <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Wei Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
e0fc2037 |
| 23-Oct-2024 |
Zi Yan <[email protected]> |
mm: avoid VM_BUG_ON when try to map an anon large folio to zero page.
An anonymous large folio can be split into non order-0 folios, try_to_map_unused_to_zeropage() should not VM_BUG_ON compound pag
mm: avoid VM_BUG_ON when try to map an anon large folio to zero page.
An anonymous large folio can be split into non order-0 folios, try_to_map_unused_to_zeropage() should not VM_BUG_ON compound pages but just return false. This fixes the crash when splitting anonymous large folios to non order-0 folios.
Link: https://lkml.kernel.org/r/[email protected] Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Zi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Usama Arif <[email protected]> Cc: Barry Song <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Nico Pache <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
7735348d |
| 02-Oct-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
migrate: Remove references to Private2
These comments are now stale; rewrite them.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Link: https://lore.kernel.org/r/20241002040111.102301
migrate: Remove references to Private2
These comments are now stale; rewrite them.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
8001070c |
| 24-Sep-2024 |
Jeongjun Park <[email protected]> |
mm: migrate: annotate data-race in migrate_folio_unmap()
I found a report from syzbot [1]
This report shows that the value can be changed, but in reality, the value of __folio_set_movable() cannot
mm: migrate: annotate data-race in migrate_folio_unmap()
I found a report from syzbot [1]
This report shows that the value can be changed, but in reality, the value of __folio_set_movable() cannot be changed because it holds the folio refcount.
Therefore, it is appropriate to add an annotate to make KCSAN ignore that data-race.
[1]
================================================================== BUG: KCSAN: data-race in __filemap_remove_folio / migrate_pages_batch
write to 0xffffea0004b81dd8 of 8 bytes by task 6348 on cpu 0: page_cache_delete mm/filemap.c:153 [inline] __filemap_remove_folio+0x1ac/0x2c0 mm/filemap.c:233 filemap_remove_folio+0x6b/0x1f0 mm/filemap.c:265 truncate_inode_folio+0x42/0x50 mm/truncate.c:178 shmem_undo_range+0x25b/0xa70 mm/shmem.c:1028 shmem_truncate_range mm/shmem.c:1144 [inline] shmem_evict_inode+0x14d/0x530 mm/shmem.c:1272 evict+0x2f0/0x580 fs/inode.c:731 iput_final fs/inode.c:1883 [inline] iput+0x42a/0x5b0 fs/inode.c:1909 dentry_unlink_inode+0x24f/0x260 fs/dcache.c:412 __dentry_kill+0x18b/0x4c0 fs/dcache.c:615 dput+0x5c/0xd0 fs/dcache.c:857 __fput+0x3fb/0x6d0 fs/file_table.c:439 ____fput+0x1c/0x30 fs/file_table.c:459 task_work_run+0x13a/0x1a0 kernel/task_work.c:228 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0xbe/0x130 kernel/entry/common.c:218 do_syscall_64+0xd6/0x1c0 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x77/0x7f
read to 0xffffea0004b81dd8 of 8 bytes by task 6342 on cpu 1: __folio_test_movable include/linux/page-flags.h:699 [inline] migrate_folio_unmap mm/migrate.c:1199 [inline] migrate_pages_batch+0x24c/0x1940 mm/migrate.c:1797 migrate_pages_sync mm/migrate.c:1963 [inline] migrate_pages+0xff1/0x1820 mm/migrate.c:2072 do_mbind mm/mempolicy.c:1390 [inline] kernel_mbind mm/mempolicy.c:1533 [inline] __do_sys_mbind mm/mempolicy.c:1607 [inline] __se_sys_mbind+0xf76/0x1160 mm/mempolicy.c:1603 __x64_sys_mbind+0x78/0x90 mm/mempolicy.c:1603 x64_sys_call+0x2b4d/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:238 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f
value changed: 0xffff888127601078 -> 0x0000000000000000
Link: https://lkml.kernel.org/r/[email protected] Fixes: 7e2a5e5ab217 ("mm: migrate: use __folio_test_movable()") Signed-off-by: Jeongjun Park <[email protected]> Reported-by: syzbot <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Zi Yan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
e0a955bf |
| 06-Sep-2024 |
Yu Zhao <[email protected]> |
mm/codetag: add pgalloc_tag_copy()
Add pgalloc_tag_copy() to transfer the codetag from the old folio to the new one during migration. This makes original allocation sites persist cross migration ra
mm/codetag: add pgalloc_tag_copy()
Add pgalloc_tag_copy() to transfer the codetag from the old folio to the new one during migration. This makes original allocation sites persist cross migration rather than lump into the get_new_folio callbacks passed into migrate_pages(), e.g., compaction_alloc():
# echo 1 >/proc/sys/vm/compact_memory # grep compaction_alloc /proc/allocinfo
Before this patch: 132968448 32463 mm/compaction.c:1880 func:compaction_alloc
After this patch: 0 0 mm/compaction.c:1880 func:compaction_alloc
Link: https://lkml.kernel.org/r/[email protected] Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging") Signed-off-by: Yu Zhao <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
cfc81938 |
| 05-Sep-2024 |
Kefeng Wang <[email protected]> |
mm: migrate: remove unused includes
random.h is not needed since commit 6c542ab75714 ("mm/demotion: build demotion targets based on explicit memory tiers"), all functions moved into memory-tiers.
n
mm: migrate: remove unused includes
random.h is not needed since commit 6c542ab75714 ("mm/demotion: build demotion targets based on explicit memory tiers"), all functions moved into memory-tiers.
nsproxy.h is not needed since commit 228ebcbe634a ("Uninline find_task_by_xxx set of functions"), no nsproxy, we only call find_task_by_vpid() now.
hugetlb_cgroup.h is not needed since commit ab5ac90aecf5 ("mm, hugetlb: do not rely on overcommit limit during migration"), move_hugetlb_state() is called and it belongs to hugetlb.h, which is already included.
balloon_compaction.h, we have more general movable_operations for non-lru movable page migration, so it could be dropped.
memremap.h, userfaultfd_k.h and oom.h are introduced for zone device page migration, but all functions are moved into migrate_device.c, so no needed anymore too.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
46dcc7c9 |
| 05-Sep-2024 |
Nanyong Sun <[email protected]> |
mm: migrate: simplify find_mm_struct()
Use find_get_task_by_vpid() to replace the task_struct find logic in find_mm_struct(), note that this patch move the ptrace_may_access() call out from rcu_read
mm: migrate: simplify find_mm_struct()
Use find_get_task_by_vpid() to replace the task_struct find logic in find_mm_struct(), note that this patch move the ptrace_may_access() call out from rcu_read_lock() scope, this is ok because it actually does not need it, find_get_task_by_vpid() already get the pid and task safely, ptrace_may_access() can use the task safely, like what sched_core_share_pid() similarly do.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nanyong Sun <[email protected]> Cc: Kefeng Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8422acdc |
| 30-Aug-2024 |
Usama Arif <[email protected]> |
mm: introduce a pageflag for partially mapped folios
Currently folio->_deferred_list is used to keep track of partially_mapped folios that are going to be split under memory pressure. In the next p
mm: introduce a pageflag for partially mapped folios
Currently folio->_deferred_list is used to keep track of partially_mapped folios that are going to be split under memory pressure. In the next patch, all THPs that are faulted in and collapsed by khugepaged are also going to be tracked using _deferred_list.
This patch introduces a pageflag to be able to distinguish between partially mapped folios and others in the deferred_list at split time in deferred_split_scan. Its needed as __folio_remove_rmap decrements _mapcount, _large_mapcount and _entire_mapcount, hence it won't be possible to distinguish between partially mapped folios and others in deferred_split_scan.
Eventhough it introduces an extra flag to track if the folio is partially mapped, there is no functional change intended with this patch and the flag is not useful in this patch itself, it will become useful in the next patch when _deferred_list has non partially mapped folios.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Usama Arif <[email protected]> Cc: Alexander Zhu <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kairui Song <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nico Pache <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuang Zhai <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Shuang Zhai <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
b1f20206 |
| 30-Aug-2024 |
Yu Zhao <[email protected]> |
mm: remap unused subpages to shared zeropage when splitting isolated thp
Patch series "mm: split underused THPs", v5.
The current upstream default policy for THP is always. However, Meta uses madv
mm: remap unused subpages to shared zeropage when splitting isolated thp
Patch series "mm: split underused THPs", v5.
The current upstream default policy for THP is always. However, Meta uses madvise in production as the current THP=always policy vastly overprovisions THPs in sparsely accessed memory areas, resulting in excessive memory pressure and premature OOM killing. Using madvise + relying on khugepaged has certain drawbacks over THP=always. Using madvise hints mean THPs aren't "transparent" and require userspace changes. Waiting for khugepaged to scan memory and collapse pages into THP can be slow and unpredictable in terms of performance (i.e. you dont know when the collapse will happen), while production environments require predictable performance. If there is enough memory available, its better for both performance and predictability to have a THP from fault time, i.e. THP=always rather than wait for khugepaged to collapse it, and deal with sparsely populated THPs when the system is running out of memory.
This patch series is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in or collapsed by khugepaged, the THP is added to a list. Whenever memory reclaim happens, the kernel runs the deferred_split shrinker which goes through the list and checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold, the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. This method avoids the downside of wasting memory in areas where THP is sparsely filled when THP is always enabled, while still providing the upside THPs like reduced TLB misses without having to use madvise.
Meta production workloads that were CPU bound (>99% CPU utilzation) were tested with THP shrinker. The results after 2 hours are as follows:
| THP=madvise | THP=always | THP=always | | | + shrinker series | | | + max_ptes_none=409 ----------------------------------------------------------------------------- Performance improvement | - | +1.8% | +1.7% (over THP=madvise) | | | ----------------------------------------------------------------------------- Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+2.4%) ----------------------------------------------------------------------------- max_ptes_none=409 means that any THP that has more than 409 out of 512 (80%) zero filled filled pages will be split.
To test out the patches, the below commands without the shrinker will invoke OOM killer immediately and kill stress, but will not fail with the shrinker:
echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none mkdir /sys/fs/cgroup/test echo $$ > /sys/fs/cgroup/test/cgroup.procs echo 20M > /sys/fs/cgroup/test/memory.max echo 0 > /sys/fs/cgroup/test/memory.swap.max # allocate twice memory.max for each stress worker and touch 40/512 of # each THP, i.e. vm-stride 50K. # With the shrinker, max_ptes_none of 470 and below won't invoke OOM # killer. # Without the shrinker, OOM killer is invoked immediately irrespective # of max_ptes_none value and kills stress. stress --vm 1 --vm-bytes 40M --vm-stride 50K
This patch (of 5):
Here being unused means containing only zeros and inaccessible to userspace. When splitting an isolated thp under reclaim or migration, the unused subpages can be mapped to the shared zeropage, hence saving memory. This is particularly helpful when the internal fragmentation of a thp is high, i.e. it has many untouched subpages.
This is also a prerequisite for THP low utilization shrinker which will be introduced in later patches, where underutilized THPs are split, and the zero-filled pages are freed saving memory.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Signed-off-by: Usama Arif <[email protected]> Tested-by: Shuang Zhai <[email protected]> Cc: Alexander Zhu <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kairui Song <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nico Pache <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuang Zhai <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
5d65c8d7 |
| 24-Aug-2024 |
Barry Song <[email protected]> |
mm: count the number of anonymous THPs per size
Patch series "mm: count the number of anonymous THPs per size", v4.
Knowing the number of transparent anon THPs in the system is crucial for performa
mm: count the number of anonymous THPs per size
Patch series "mm: count the number of anonymous THPs per size", v4.
Knowing the number of transparent anon THPs in the system is crucial for performance analysis. It helps in understanding the ratio and distribution of THPs versus small folios throughout the system.
Additionally, partial unmapping by userspace can lead to significant waste of THPs over time and increase memory reclamation pressure. We need this information for comprehensive system tuning.
This patch (of 2):
Let's track for each anonymous THP size, how many of them are currently allocated. We'll track the complete lifespan of an anon THP, starting when it becomes an anon THP ("large anon folio") (->mapping gets set), until it gets freed (->mapping gets cleared).
Introduce a new "nr_anon" counter per THP size and adjust the corresponding counter in the following cases: * We allocate a new THP and call folio_add_new_anon_rmap() to map it the first time and turn it into an anon THP. * We split an anon THP into multiple smaller ones. * We migrate an anon THP, when we prepare the destination. * We free an anon THP back to the buddy.
Note that AnonPages in /proc/meminfo currently tracks the total number of *mapped* anonymous *pages*, and therefore has slightly different semantics. In the future, we might also want to track "nr_anon_mapped" for each THP size, which might be helpful when comparing it to the number of allocated anon THPs (long-term pinning, stuck in swapcache, memory leaks, ...).
Further note that for now, we only track anon THPs after they got their ->mapping set, for example via folio_add_new_anon_rmap(). If we would allocate some in the swapcache, they will only show up in the statistics for now after they have been mapped to user space the first time, where we call folio_add_new_anon_rmap().
[[email protected]: documentation fixups, per David] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Chris Li <[email protected]> Cc: Chuanhua Han <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Lance Yang <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuai Yuan <[email protected]> Cc: Usama Arif <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|