|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
74949222 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
Everything is in place to stop using the per-page mapcounts in large folios: the mapcount of tail pages will alwa
mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
Everything is in place to stop using the per-page mapcounts in large folios: the mapcount of tail pages will always be logically 0 (-1 value), just like it currently is for hugetlb folios already, and the page mapcount of the head page is either 0 (-1 value) or contains a page type (e.g., hugetlb).
Maintaining _nr_pages_mapped without per-page mapcounts is impossible, so that one also has to go with CONFIG_NO_PAGE_MAPCOUNT.
There are two remaining implications:
(1) Per-node, per-cgroup and per-lruvec stats of "NR_ANON_MAPPED" ("mapped anonymous memory") and "NR_FILE_MAPPED" ("mapped file memory"):
As soon as any page of the folio is mapped -- folio_mapped() -- we now account the complete folio as mapped. Once the last page is unmapped -- !folio_mapped() -- we account the complete folio as unmapped.
This implies that ...
* "AnonPages" and "Mapped" in /proc/meminfo and /sys/devices/system/node/*/meminfo * cgroup v2: "anon" and "file_mapped" in "memory.stat" and "memory.numa_stat" * cgroup v1: "rss" and "mapped_file" in "memory.stat" and "memory.numa_stat
... can now appear higher than before. But note that these folios do consume that memory, simply not all pages are actually currently mapped.
It's worth nothing that other accounting in the kernel (esp. cgroup charging on allocation) is not affected by this change.
[why oh why is "anon" called "rss" in cgroup v1]
(2) Detecting partial mappings
Detecting whether anon THPs are partially mapped gets a bit more unreliable. As long as a single MM maps such a large folio ("exclusively mapped"), we can reliably detect it. Especially before fork() / after a short-lived child process quit, we will detect partial mappings reliably, which is the common case.
In essence, if the average per-page mapcount in an anon THP is < 1, we know for sure that we have a partial mapping.
However, as soon as multiple MMs are involved, we might miss detecting partial mappings: this might be relevant with long-lived child processes. If we have a fully-mapped anon folio before fork(), once our child processes and our parent all unmap (zap/COW) the same pages (but not the complete folio), we might not detect the partial mapping. However, once the child processes quit we would detect the partial mapping.
How relevant this case is in practice remains to be seen. Swapout/migration will likely mitigate this.
In the future, RMAP walkers could check for that for that case (e.g., when collecting access bits during reclaim) and simply flag them for deferred-splitting.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
003fde44 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared()
Let's reuse our new MM ownership tracking infrastructure for large folios to make folio_likely_mapped_shared() never return fa
mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared()
Let's reuse our new MM ownership tracking infrastructure for large folios to make folio_likely_mapped_shared() never return false negatives -- never indicating "not mapped shared" although the folio *is* mapped shared. With that, we can rename it to folio_maybe_mapped_shared() and get rid of the dependency on the mapcount of the first folio page.
The semantics are now arguably clearer: no mixture of "false negatives" and "false positives", only the remaining possibility for "false positives".
Thoroughly document the new semantics. We might now detect that a large folio is "maybe mapped shared" although it *no longer* is -- but once was. Now, if more than two MMs mapped a folio at the same time, and the MM mapping the folio exclusively at the end is not one tracked in the two folio MM slots, we will detect the folio as "maybe mapped shared".
For anonymous folios, usually (except weird corner cases) all PTEs that target a "maybe mapped shared" folio are R/O. As soon as a child process would write to them (iow, actively use them), we would CoW and effectively replace these PTEs. Most cases (below) are not expected to really matter with large anonymous folios for this reason.
Most importantly, there will be no change at all for: * small folios * hugetlb folios * PMD-mapped PMD-sized THPs (single mapping)
This change has the potential to affect existing callers of folio_likely_mapped_shared() -> folio_maybe_mapped_shared():
(1) fs/proc/task_mmu.c: no change (hugetlb)
(2) khugepaged counts PTEs that target shared folios towards max_ptes_shared (default: HPAGE_PMD_NR / 2), meaning we could skip a collapse where we would have previously collapsed. This only applies to anonymous folios and is not expected to matter in practice.
Worth noting that this change sorts out case (A) documented in commit 1bafe96e89f0 ("mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()") by removing the possibility for "false negatives".
(3) MADV_COLD / MADV_PAGEOUT / MADV_FREE will not try splitting PTE-mapped THPs that are considered shared but not fully covered by the requested range, consequently not processing them.
PMD-mapped PMD-sized THP are not affected, or when all PTEs are covered. These functions are usually only called on anon/file folios that are exclusively mapped most of the time (no other file mappings or no fork()), so the "false negatives" are not expected to matter in practice.
(4) mbind() / migrate_pages() / move_pages() will refuse to migrate shared folios unless MPOL_MF_MOVE_ALL is effective (requires CAP_SYS_NICE). We will now reject some folios that could be migrated.
Similar to (3), especially with MPOL_MF_MOVE_ALL, so this is not expected to matter in practice.
Note that cpuset_migrate_mm_workfn() calls do_migrate_pages() with MPOL_MF_MOVE_ALL.
(5) NUMA hinting
mm/migrate.c:migrate_misplaced_folio_prepare() will skip file folios that are probably shared libraries (-> "mapped shared" and executable). This check would have detected it as a shared library at some point (at least 3 MMs mapping it), so detecting it afterwards does not sound wrong (still a shared library). Not expected to matter.
mm/memory.c:numa_migrate_check() will indicate TNF_SHARED in MAP_SHARED file mappings when encountering a shared folio. Similar reasoning, not expected to matter.
mm/mprotect.c:change_pte_range() will skip folios detected as shared in CoW mappings. Similarly, this is not expected to matter in practice, but if it would ever be a problem we could relax that check a bit (e.g., basing it on the average page-mapcount in a folio), because it was only an optimization when many (e.g., 288) processes were mapping the same folios -- see commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages.")
(6) mm/rmap.c:folio_referenced_one() will skip exclusive swapbacked folios in dying processes. Applies to anonymous folios only. Without "false negatives", we'll now skip all actually shared ones. Skipping ones that are actually exclusive won't really matter, it's a pure optimization, and is not expected to matter in practice.
In theory, one can detect the problematic scenario: folio_mapcount() > 0 and no folio MM slot is occupied ("state unknown"). One could reset the MM slots while doing an rmap walk, which migration / folio split already do when setting everything up. Further, when batching PTEs we might naturally learn about a owner (e.g., folio_mapcount() == nr_ptes) and could update the owner. However, we'll defer that until the scenarios where it would really matter are clear.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
44885447 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: use folio_large_nr_pages() in add/remove functions
Let's just use the "large" variant in code where we are sure that we have a large folio in our hands: this way we are sure that we don't p
mm/rmap: use folio_large_nr_pages() in add/remove functions
Let's just use the "large" variant in code where we are sure that we have a large folio in our hands: this way we are sure that we don't perform any unnecessary "large" checks.
While at it, convert the VM_BUG_ON_VMA to a VM_WARN_ON_ONCE.
Maybe in the future there will not be a difference in that regard between large and small folios; in that case, unifying the handling again will be easy. E.g., folio_large_nr_pages() will simply translate to folio_nr_pages() until we replace all instances.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
932961c4 |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: abstract large mapcount operations for large folios (!hugetlb)
Let's abstract the operations so we can extend these operations easily.
Link: https://lkml.kernel.org/r/20250303163014.112803
mm/rmap: abstract large mapcount operations for large folios (!hugetlb)
Let's abstract the operations so we can extend these operations easily.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
1862a4af |
| 03-Mar-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: pass vma to __folio_add_rmap()
We'll need access to the destination MM when modifying the mapcount large folios next. So pass in the VMA.
Link: https://lkml.kernel.org/r/20250303163014.11
mm/rmap: pass vma to __folio_add_rmap()
We'll need access to the destination MM when modifying the mapcount large folios next. So pass in the VMA.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5 |
|
| #
349994cf |
| 28-Feb-2025 |
Alistair Popple <[email protected]> |
mm/rmap: add support for PUD sized mappings to rmap
The rmap doesn't currently support adding a PUD mapping of a folio. This patch adds support for entire PUD mappings of folios, primarily to allow
mm/rmap: add support for PUD sized mappings to rmap
The rmap doesn't currently support adding a PUD mapping of a folio. This patch adds support for entire PUD mappings of folios, primarily to allow for more standard refcounting of device DAX folios. Currently DAX is the only user of this and it doesn't require support for partially mapped PUD-sized folios so we don't support for that for now.
Link: https://lkml.kernel.org/r/248582c07896e30627d1aeaeebc6949cfd91b851.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Dan Williams <[email protected]> Tested-by: Alison Schofield <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Asahi Lina <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chunyan Zhang <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: linmiaohe <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Ted Ts'o <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3 |
|
| #
2f9b43d6 |
| 14-Feb-2025 |
Barry Song <[email protected]> |
mm: avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
The try_to_unmap_one() function currently handles PMD-mapped THPs inefficiently. It first splits the PMD into PTEs, copies the di
mm: avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
The try_to_unmap_one() function currently handles PMD-mapped THPs inefficiently. It first splits the PMD into PTEs, copies the dirty state from the PMD to the PTEs, iterates over the PTEs to locate the dirty state, and then marks the THP as swap-backed. This process involves unnecessary PMD splitting and redundant iteration. Instead, this functionality can be efficiently managed in __discard_anon_folio_pmd_locked(), avoiding the extra steps and improving performance.
The following microbenchmark redirties folios after invoking MADV_FREE, then measures the time taken to perform memory reclamation (actually set those folios swapbacked again) on the redirtied folios.
#include <stdio.h> #include <sys/mman.h> #include <string.h> #include <time.h>
#define SIZE 128*1024*1024 // 128 MB
int main(int argc, char *argv[]) { while(1) { volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
memset((void *)p, 1, SIZE); madvise((void *)p, SIZE, MADV_FREE); /* redirty after MADV_FREE */ memset((void *)p, 1, SIZE);
clock_t start_time = clock(); madvise((void *)p, SIZE, MADV_PAGEOUT); clock_t end_time = clock();
double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC; printf("Time taken by reclamation: %f seconds\n", elapsed_time);
munmap((void *)p, SIZE); } return 0; }
Testing results are as below, w/o patch: ~ # ./a.out Time taken by reclamation: 0.007300 seconds Time taken by reclamation: 0.007226 seconds Time taken by reclamation: 0.007295 seconds Time taken by reclamation: 0.007731 seconds Time taken by reclamation: 0.007134 seconds Time taken by reclamation: 0.007285 seconds Time taken by reclamation: 0.007720 seconds Time taken by reclamation: 0.007128 seconds Time taken by reclamation: 0.007710 seconds Time taken by reclamation: 0.007712 seconds Time taken by reclamation: 0.007236 seconds Time taken by reclamation: 0.007690 seconds Time taken by reclamation: 0.007174 seconds Time taken by reclamation: 0.007670 seconds Time taken by reclamation: 0.007169 seconds Time taken by reclamation: 0.007305 seconds Time taken by reclamation: 0.007432 seconds Time taken by reclamation: 0.007158 seconds Time taken by reclamation: 0.007133 seconds …
w/ patch
~ # ./a.out Time taken by reclamation: 0.002124 seconds Time taken by reclamation: 0.002116 seconds Time taken by reclamation: 0.002150 seconds Time taken by reclamation: 0.002261 seconds Time taken by reclamation: 0.002137 seconds Time taken by reclamation: 0.002173 seconds Time taken by reclamation: 0.002063 seconds Time taken by reclamation: 0.002088 seconds Time taken by reclamation: 0.002169 seconds Time taken by reclamation: 0.002124 seconds Time taken by reclamation: 0.002111 seconds Time taken by reclamation: 0.002224 seconds Time taken by reclamation: 0.002297 seconds Time taken by reclamation: 0.002260 seconds Time taken by reclamation: 0.002246 seconds Time taken by reclamation: 0.002272 seconds Time taken by reclamation: 0.002277 seconds Time taken by reclamation: 0.002462 seconds …
This patch significantly speeds up try_to_unmap_one() by allowing it to skip redirtied THPs without splitting the PMD.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Suggested-by: Baolin Wang <[email protected]> Suggested-by: Lance Yang <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Lance Yang <[email protected]> Cc: Albert Ou <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chis Li <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Gavin Shan <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mauricio Faria de Oliveira <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shaoqin Huang <[email protected]> Cc: Tangquan Zheng <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yicong Yang <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
354dffd2 |
| 14-Feb-2025 |
Barry Song <[email protected]> |
mm: support batched unmap for lazyfree large folios during reclamation
Currently, the PTEs and rmap of a large folio are removed one at a time. This is not only slow but also causes the large folio
mm: support batched unmap for lazyfree large folios during reclamation
Currently, the PTEs and rmap of a large folio are removed one at a time. This is not only slow but also causes the large folio to be unnecessarily added to deferred_split, which can lead to races between the deferred_split shrinker callback and memory reclamation. This patch releases all PTEs and rmap entries in a batch. Currently, it only handles lazyfree large folios.
The below microbench tries to reclaim 128MB lazyfree large folios whose sizes are 64KiB:
#include <stdio.h> #include <sys/mman.h> #include <string.h> #include <time.h>
#define SIZE 128*1024*1024 // 128 MB
unsigned long read_split_deferred() { FILE *file = fopen("/sys/kernel/mm/transparent_hugepage" "/hugepages-64kB/stats/split_deferred", "r"); if (!file) { perror("Error opening file"); return 0; }
unsigned long value; if (fscanf(file, "%lu", &value) != 1) { perror("Error reading value"); fclose(file); return 0; }
fclose(file); return value; }
int main(int argc, char *argv[]) { while(1) { volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
memset((void *)p, 1, SIZE);
madvise((void *)p, SIZE, MADV_FREE);
clock_t start_time = clock(); unsigned long start_split = read_split_deferred(); madvise((void *)p, SIZE, MADV_PAGEOUT); clock_t end_time = clock(); unsigned long end_split = read_split_deferred();
double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC; printf("Time taken by reclamation: %f seconds, split_deferred: %ld\n", elapsed_time, end_split - start_split);
munmap((void *)p, SIZE); } return 0; }
w/o patch: ~ # ./a.out Time taken by reclamation: 0.177418 seconds, split_deferred: 2048 Time taken by reclamation: 0.178348 seconds, split_deferred: 2048 Time taken by reclamation: 0.174525 seconds, split_deferred: 2048 Time taken by reclamation: 0.171620 seconds, split_deferred: 2048 Time taken by reclamation: 0.172241 seconds, split_deferred: 2048 Time taken by reclamation: 0.174003 seconds, split_deferred: 2048 Time taken by reclamation: 0.171058 seconds, split_deferred: 2048 Time taken by reclamation: 0.171993 seconds, split_deferred: 2048 Time taken by reclamation: 0.169829 seconds, split_deferred: 2048 Time taken by reclamation: 0.172895 seconds, split_deferred: 2048 Time taken by reclamation: 0.176063 seconds, split_deferred: 2048 Time taken by reclamation: 0.172568 seconds, split_deferred: 2048 Time taken by reclamation: 0.171185 seconds, split_deferred: 2048 Time taken by reclamation: 0.170632 seconds, split_deferred: 2048 Time taken by reclamation: 0.170208 seconds, split_deferred: 2048 Time taken by reclamation: 0.174192 seconds, split_deferred: 2048 ...
w/ patch: ~ # ./a.out Time taken by reclamation: 0.074231 seconds, split_deferred: 0 Time taken by reclamation: 0.071026 seconds, split_deferred: 0 Time taken by reclamation: 0.072029 seconds, split_deferred: 0 Time taken by reclamation: 0.071873 seconds, split_deferred: 0 Time taken by reclamation: 0.073573 seconds, split_deferred: 0 Time taken by reclamation: 0.071906 seconds, split_deferred: 0 Time taken by reclamation: 0.073604 seconds, split_deferred: 0 Time taken by reclamation: 0.075903 seconds, split_deferred: 0 Time taken by reclamation: 0.073191 seconds, split_deferred: 0 Time taken by reclamation: 0.071228 seconds, split_deferred: 0 Time taken by reclamation: 0.071391 seconds, split_deferred: 0 Time taken by reclamation: 0.071468 seconds, split_deferred: 0 Time taken by reclamation: 0.071896 seconds, split_deferred: 0 Time taken by reclamation: 0.072508 seconds, split_deferred: 0 Time taken by reclamation: 0.071884 seconds, split_deferred: 0 Time taken by reclamation: 0.072433 seconds, split_deferred: 0 Time taken by reclamation: 0.071939 seconds, split_deferred: 0 ...
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Cc: Albert Ou <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chis Li <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Gavin Shan <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Lance Yang <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mauricio Faria de Oliveira <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shaoqin Huang <[email protected]> Cc: Tangquan Zheng <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yicong Yang <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
2f4ab3ac |
| 14-Feb-2025 |
Barry Song <[email protected]> |
mm: support tlbbatch flush for a range of PTEs
This patch lays the groundwork for supporting batch PTE unmapping in try_to_unmap_one(). It introduces range handling for TLB batch flushing, with the
mm: support tlbbatch flush for a range of PTEs
This patch lays the groundwork for supporting batch PTE unmapping in try_to_unmap_one(). It introduces range handling for TLB batch flushing, with the range currently set to the size of PAGE_SIZE.
The function __flush_tlb_range_nosync() is architecture-specific and is only used within arch/arm64. This function requires the mm structure instead of the vma structure. To allow its reuse by arch_tlbbatch_add_pending(), which operates with mm but not vma, this patch modifies the argument of __flush_tlb_range_nosync() to take mm as its parameter.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Acked-by: Will Deacon <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shaoqin Huang <[email protected]> Cc: Gavin Shan <[email protected]> Cc: Mark Rutland <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Lance Yang <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Albert Ou <[email protected]> Cc: Yicong Yang <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Chis Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Kairui Song <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mauricio Faria de Oliveira <[email protected]> Cc: Tangquan Zheng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
faeb2831 |
| 14-Feb-2025 |
Barry Song <[email protected]> |
mm: set folio swapbacked iff folios are dirty in try_to_unmap_one
Patch series "mm: batched unmap lazyfree large folios during reclamation", v4.
Commit 735ecdfaf4e8 ("mm/vmscan: avoid split lazyfre
mm: set folio swapbacked iff folios are dirty in try_to_unmap_one
Patch series "mm: batched unmap lazyfree large folios during reclamation", v4.
Commit 735ecdfaf4e8 ("mm/vmscan: avoid split lazyfree THP during shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in madvise.c.
However, those folios are still added to the deferred_split list in try_to_unmap_one() because we are unmapping PTEs and removing rmap entries one by one.
Firstly, this has rendered the following counter somewhat confusing, /sys/kernel/mm/transparent_hugepage/hugepages-size/stats/split_deferred The split_deferred counter was originally designed to track operations such as partial unmap or madvise of large folios. However, in practice, most split_deferred cases arise from memory reclamation of aligned lazyfree mTHPs as observed by Tangquan. This discrepancy has made the split_deferred counter highly misleading.
Secondly, this approach is slow because it requires iterating through each PTE and removing the rmap one by one for a large folio. In fact, all PTEs of a pte-mapped large folio should be unmapped at once, and the entire folio should be removed from the rmap as a whole.
Thirdly, it also increases the risk of a race condition where lazyfree folios are incorrectly set back to swapbacked, as a speculative folio_get may occur in the shrinker's callback.
deferred_split_scan() might call folio_try_get(folio) since we have added the folio to split_deferred list while removing rmap for the 1st subpage, and while we are scanning the 2nd to nr_pages PTEs of this folio in try_to_unmap_one(), the entire mTHP could be transitioned back to swap-backed because the reference count is incremented, which can make "ref_count == 1 + map_count" within try_to_unmap_one() false.
/* * The only page refs must be one from isolation * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && (!folio_test_dirty(folio) || ... (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; }
This patchset resolves the issue by marking only genuinely dirty folios as swap-backed, as suggested by David, and transitioning to batched unmapping of entire folios in try_to_unmap_one(). Consequently, the deferred_split count drops to zero, and memory reclamation performance improves significantly — reclaiming 64KiB lazyfree large folios is now 2.5x faster(The specific data is embedded in the changelog of patch 3/4).
By the way, while the patchset is primarily aimed at PTE-mapped large folios, Baolin and Lance also found that try_to_unmap_one() handles lazyfree redirtied PMD-mapped large folios inefficiently — it splits the PMD into PTEs and iterates over them. This patchset removes the unnecessary splitting, enabling us to skip redirtied PMD-mapped large folios 3.5X faster during memory reclamation. (The specific data can be found in the changelog of patch 4/4).
This patch (of 4):
The refcount may be temporarily or long-term increased, but this does not change the fundamental nature of the folio already being lazy- freed. Therefore, we only reset 'swapbacked' when we are certain the folio is dirty and not droppable.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 6c8e2a256915 ("mm: fix race between MADV_FREE reclaim and blkdev direct IO read") Signed-off-by: Barry Song <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Lance Yang <[email protected]> Cc: Mauricio Faria de Oliveira <[email protected]> Cc: Chis Li <[email protected]> (Google) Cc: "Huang, Ying" <[email protected]> Cc: Kairui Song <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Tangquan Zheng <[email protected]> Cc: Albert Ou <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Gavin Shan <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Shaoqin Huang <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yicong Yang <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2 |
|
| #
a4811f53 |
| 08-Feb-2025 |
Lorenzo Stoakes <[email protected]> |
mm: provide mapping_wrprotect_range() function
In the fb_defio video driver, page dirty state is used to determine when frame buffer pages have been changed, allowing for batched, deferred I/O to be
mm: provide mapping_wrprotect_range() function
In the fb_defio video driver, page dirty state is used to determine when frame buffer pages have been changed, allowing for batched, deferred I/O to be performed for efficiency.
This implementation had only one means of doing so effectively - the use of the folio_mkclean() function.
However, this use of the function is inappropriate, as the fb_defio implementation allocates kernel memory to back the framebuffer, and then is forced to specified page->index, mapping fields in order to permit the folio_mkclean() rmap traversal to proceed correctly.
It is not correct to specify these fields on kernel-allocated memory, and moreover since these are not folios, page->index, mapping are deprecated fields, soon to be removed.
We therefore need to provide a means by which we can correctly traverse the reverse mapping and write-protect mappings for a page backing an address_space page cache object at a given offset.
This patch provides this - mapping_wrprotect_range() - which allows for this operation to be performed for a specified address_space, offset, PFN and size, without requiring a folio nor, of course, an inappropriate use of page->index, mapping.
With this provided, we can subsequently adjust the fb_defio implementation to make use of this function and avoid incorrect invocation of folio_mkclean() and more importantly, incorrect manipulation of page->index and mapping fields.
Link: https://lkml.kernel.org/r/e5bf969d64e7f2f2ae944d42341fc8994b736a81.1739029358.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Helge Deller <[email protected]> Cc: Jaya Kumar <[email protected]> Cc: Kajtar Zsolt <[email protected]> Cc: Maíra Canal <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Thomas Zimemrmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
cedae194 |
| 08-Feb-2025 |
Lorenzo Stoakes <[email protected]> |
mm: refactor rmap_walk_file() to separate out traversal logic
Patch series "expose mapping wrprotect, fix fb_defio use", v3.
Right now the only means by which we can write-protect a range using the
mm: refactor rmap_walk_file() to separate out traversal logic
Patch series "expose mapping wrprotect, fix fb_defio use", v3.
Right now the only means by which we can write-protect a range using the reverse mapping is via folio_mkclean().
However this is not always the appropriate means of doing so, specifically in the case of the framebuffer deferred I/O logic (fb_defio enabled by CONFIG_FB_DEFERRED_IO). There, kernel pages are mapped read-only and write-protect faults used to batch up I/O operations.
Each time the deferred work is done, folio_mkclean() is used to mark the framebuffer page as having had I/O performed on it. However doing so requires the kernel page (perhaps allocated via vmalloc()) to have its page->mapping, index fields set so the rmap can find everything that maps it in order to write-protect.
This is problematic as firstly, these fields should not be set for kernel-allocated memory, and secondly these are not folios (it's not user memory) and page->index, mapping fields are now deprecated and soon to be removed.
The removal of these fields is imminent, rendering this series more urgent than it might first appear.
The implementers cannot be blamed for having used this however, as there is simply no other way of performing this operation correctly.
This series fixes this - we provide the mapping_wrprotect_range() function to allow the reverse mapping to be used to look up mappings from the page cache object (i.e. its address_space pointer) at a specific offset.
The fb_defio logic already stores this offset, and can simply be expanded to keep track of the page cache object, so the change then becomes straight-forward.
This series should have no functional change.
This patch (of 3):
In order to permit the traversal of the reverse mapping at a specified mapping and offset rather than those specified by an input folio, we need to separate out the portion of the rmap file logic which deals with this traversal from those parts of the logic which interact with the folio.
This patch achieves this by adding a new static __rmap_walk_file() function which rmap_walk_file() invokes.
This function permits the ability to pass NULL folio, on the assumption that the caller has provided for this correctly in the callbacks specified in the rmap_walk_control object.
Though it provides for this, and adds debug asserts to ensure that, should a folio be specified, these are equal to the mapping and offset specified in the folio, there should be no functional change as a result of this patch.
The reason for adding this is to enable for future changes to permit users to be able to traverse mappings of userland-mapped kernel memory, write-protecting those mappings to enable page_mkwrite() or pfn_mkwrite() fault handlers to be retriggered on subsequent dirty.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/0d1acec0cba1e5a12f9b53efcabc397541c90517.1739029358.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Helge Deller <[email protected]> Cc: Jaya Kumar <[email protected]> Cc: Kajtar Zsolt <[email protected]> Cc: Maíra Canal <[email protected]> Cc: Matthew Wilcox <[email protected]> [English fixes] Cc: Simona Vetter <[email protected]> Cc: Thomas Zimemrmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
b8932ca8 |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: avoid -EBUSY from make_device_exclusive()
Failing to obtain the folio lock, for example because the folio is concurrently getting migrated or swapped out, can easily make the callers fail:
mm/rmap: avoid -EBUSY from make_device_exclusive()
Failing to obtain the folio lock, for example because the folio is concurrently getting migrated or swapped out, can easily make the callers fail: for example, the hmm selftest can sometimes be observed to fail because of this. Instead of forcing the caller to retry, let's simply retry in this to-be-expected case.
Similarly, avoid spurious failures simply because we raced with someone (e.g., swapout) modifying the page table such that our folio_walk fails.
Simply unconditionally lock the folio, and retry GUP if our folio_walk fails. Note that the folio_walk repeatedly failing is not something we expect.
Note that we might want to avoid grabbing the folio lock at some point; for now, keep that as is and only unconditionally lock the folio.
With this change, the hmm selftests don't fail simply because the folio is already locked. While this fixes the selftests in some cases, it's likely not something that deserves a "Fixes:".
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
f495bd7e |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: keep mapcount untouched for device-exclusive entries
Now that conversion to device-exclusive does no longer perform an rmap walk and all page_vma_mapped_walk() users were taught to properly
mm/rmap: keep mapcount untouched for device-exclusive entries
Now that conversion to device-exclusive does no longer perform an rmap walk and all page_vma_mapped_walk() users were taught to properly handle device-exclusive entries, let's treat device-exclusive entries just as if they would be present, similar to how we handle device-private entries already.
This fixes swapout/migration/split/hwpoison of folios with device-exclusive entries.
We only had to take care of page_vma_mapped_walk() users, because these traditionally assume pte_present(). Other page table walkers already have to handle !pte_present(), and some of them might simply skip them (e.g., MADV_PAGEOUT) if they are not specialized on them. This change doesn't modify the latter.
Note that while folios with device-exclusive PTEs can now get migrated, khugepaged will not collapse a THP if there is device-exclusive PTE. Doing so might also not be desired if the device frequently performs atomics to the same page. Similarly, KSM will never merge order-0 folios that are device-exclusive.
Link: https://lkml.kernel.org/r/[email protected] Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
bd1f2b2a |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one()
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from pa
mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one()
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk().
page_vma_mkclean_one() is not prepared for that, so teach it about these PFN swap PTEs. Note that device-private entries are so far not applicable on that path, as we expect ZONE_DEVICE pages so far only in migration code when it comes to the RMAP.
Note that we could currently only run into this case with device-exclusive entries on THPs. We still adjust the mapcount on conversion to device-exclusive; this makes the rmap walk abort early for small folios, because we'll always have !folio_mapped() with a single device-exclusive entry. We'll adjust the mapcount logic once all page_vma_mapped_walk() users can properly handle device-exclusive entries.
Link: https://lkml.kernel.org/r/[email protected] Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
bf983108 |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one()
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page
mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one()
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk().
try_to_migrate_one() is not prepared for that, so teach it about these PFN swap PTEs. We already handle device-private entries by specializing on the folio, so we can reshuffle that code to make it work on the PFN swap PTEs instead.
Get rid of the folio_is_device_private() handling. Note that we never currently expect device-private folios with HWPoison flag set at that point, so add a warning in case that ever changes and we can figure out what the right thing to do is.
Note that we could currently only run into this case with device-exclusive entries on THPs. We still adjust the mapcount on conversion to device-exclusive; this makes the rmap walk abort early for small folios, because we'll always have !folio_mapped() with a single device-exclusive entry. We'll adjust the mapcount logic once all page_vma_mapped_walk() users can properly handle device-exclusive entries.
Further note that try_to_migrate() calls MMU notifiers and holds the folio lock, so any device-exclusive users should be properly prepared for a device-exclusive PTE to "vanish".
Link: https://lkml.kernel.org/r/[email protected] Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
65529295 |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_v
mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk().
try_to_unmap_one() is not prepared for that, so teach it about these PFN swap PTEs. Note that device-private entries are so far not applicable on that path, as we expect ZONE_DEVICE pages so far only in migration code when it comes to the RMAP.
Note that we could currently only run into this case with device-exclusive entries on THPs. We still adjust the mapcount on conversion to device-exclusive; this makes the rmap walk abort early for small folios, because we'll always have !folio_mapped() with a single device-exclusive entry. We'll adjust the mapcount logic once all page_vma_mapped_walk() users can properly handle device-exclusive entries.
Further note that try_to_unmap() calls MMU notifiers and holds the folio lock, so any device-exclusive users should be properly prepared for a device-exclusive PTE to "vanish".
Link: https://lkml.kernel.org/r/[email protected] Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
c25465eb |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm: use single SWP_DEVICE_EXCLUSIVE entry type
There is no need for the distinction anymore; let's merge the readable and writable device-exclusive entries into a single device-exclusive entry type.
mm: use single SWP_DEVICE_EXCLUSIVE entry type
There is no need for the distinction anymore; let's merge the readable and writable device-exclusive entries into a single device-exclusive entry type.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Simona Vetter <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
43835472 |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
We require a writable PTE and only support anonymous folio: we can only have exactly one PTE pointing at that page, w
mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
We require a writable PTE and only support anonymous folio: we can only have exactly one PTE pointing at that page, which we can just lookup using a folio walk, avoiding the rmap walk and the anon VMA lock.
So let's stop doing an rmap walk and perform a folio walk instead, so we can easily just modify a single PTE and avoid relying on rmap/mapcounts.
We now effectively work on a single PTE instead of multiple PTEs of a large folio, allowing for conversion of individual PTEs from non-exclusive to device-exclusive -- note that the opposite direction always works on single PTEs: restore_exclusive_pte().
With this change, device-exclusive handling is fully compatible with THPs / large folios. We still require PMD-sized THPs to get PTE-mapped, and supporting PMD-mapped THP (without the PTE-remapping) is a different endeavour that might not be worth it at this point: it might even have negative side-effects [1].
This gets rid of the "folio_mapcount()" usage and let's us fix ordinary rmap walks (migration/swapout) next. Spell out that messing with the mapcount is wrong and must be fixed.
[1] https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
599b684a |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
The single "real" user in the tree of make_device_exclusive_range() always requests making only a single address exclusive.
mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
The single "real" user in the tree of make_device_exclusive_range() always requests making only a single address exclusive. The current implementation is hard to fix for properly supporting anonymous THP / large folios and for avoiding messing with rmap walks in weird ways.
So let's always process a single address/page and return folio + page to minimize page -> folio lookups. This is a preparation for further changes.
Reject any non-anonymous or hugetlb folios early, directly after GUP.
While at it, extend the documentation of make_device_exclusive() to clarify some things.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Simona Vetter <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
bc3fe680 |
| 10-Feb-2025 |
David Hildenbrand <[email protected]> |
mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
Even though FOLL_SPLIT_PMD on hugetlb now always fails with -EOPNOTSUPP, let's add a safety net in case FOLL_SPLIT_PMD usage would eve
mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
Even though FOLL_SPLIT_PMD on hugetlb now always fails with -EOPNOTSUPP, let's add a safety net in case FOLL_SPLIT_PMD usage would ever be reworked.
In particular, before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code"), GUP(FOLL_SPLIT_PMD) would just have returned a page. In particular, hugetlb folios that are not PMD-sized would never have been prone to FOLL_SPLIT_PMD.
hugetlb folios can be anonymous, and page_make_device_exclusive_one() is not really prepared for handling them at all. So let's spell that out.
Link: https://lkml.kernel.org/r/[email protected] Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Tested-by: Alistair Popple <[email protected]> Cc: Alex Shi <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Lyude <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Barry Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
68158bfa |
| 05-Oct-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: mass constification of folio/page pointers
Now that page_pgoff() takes const pointers, we can constify the pointers to a lot of functions.
Link: https://lkml.kernel.org/r/20241005200121.3231142
mm: mass constification of folio/page pointers
Now that page_pgoff() takes const pointers, we can constify the pointers to a lot of functions.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
713da0b3 |
| 05-Oct-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: renovate page_address_in_vma()
This function doesn't modify any of its arguments, so if we make a few other functions take const pointers, we can make page_address_in_vma() take const pointers t
mm: renovate page_address_in_vma()
This function doesn't modify any of its arguments, so if we make a few other functions take const pointers, we can make page_address_in_vma() take const pointers too. All of its callers have the containing folio already, so pass that in as an argument instead of recalculating it. Also add kernel-doc
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
7d3e93ec |
| 05-Oct-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: use page_pgoff() in more places
There are several places which currently open-code page_pgoff(), convert them to call it.
Link: https://lkml.kernel.org/r/20241005200121.3231142-3-willy@infradea
mm: use page_pgoff() in more places
There are several places which currently open-code page_pgoff(), convert them to call it.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
f7470591 |
| 05-Oct-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: convert page_to_pgoff() to page_pgoff()
Patch series "page->index removals in mm", v2.
As part of shrinking struct page, we need to stop using page->index. This patchset gets rid of most of th
mm: convert page_to_pgoff() to page_pgoff()
Patch series "page->index removals in mm", v2.
As part of shrinking struct page, we need to stop using page->index. This patchset gets rid of most of the remaining references to page->index in mm, as well as increasing the number of functions which take a const folio/page pointer. It shrinks the text segment of mm by a few hundred bytes in my test config, probably mostly from removing calls to compound_head() in page_to_pgoff().
This patch (of 7):
Change the function signature to pass in the folio as all three callers have it. This removes a reference to page->index, which we're trying to get rid of. And add kernel-doc.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|