History log of /linux-6.15/include/linux/swapops.h (Results 1 – 25 of 67)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3
# c25465eb 10-Feb-2025 David Hildenbrand <[email protected]>

mm: use single SWP_DEVICE_EXCLUSIVE entry type

There is no need for the distinction anymore; let's merge the readable and
writable device-exclusive entries into a single device-exclusive entry
type.

mm: use single SWP_DEVICE_EXCLUSIVE entry type

There is no need for the distinction anymore; let's merge the readable and
writable device-exclusive entries into a single device-exclusive entry
type.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Simona Vetter <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Tested-by: Alistair Popple <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Karol Herbst <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Lyude <[email protected]>
Cc: "Masami Hiramatsu (Google)" <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: Barry Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6
# 7c53dfbd 28-Oct-2024 Lorenzo Stoakes <[email protected]>

mm: add PTE_MARKER_GUARD PTE marker

Add a new PTE marker that results in any access causing the accessing
process to segfault.

This is preferable to PTE_MARKER_POISONED, which results in the same
h

mm: add PTE_MARKER_GUARD PTE marker

Add a new PTE marker that results in any access causing the accessing
process to segfault.

This is preferable to PTE_MARKER_POISONED, which results in the same
handling as hardware poisoned memory, and is thus undesirable for cases
where we simply wish to 'soft' poison a range.

This is in preparation for implementing the ability to specify guard pages
at the page table level, i.e. ranges that, when accessed, should cause
process termination.

Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
purpose was simply incorrect.

We then reuse the same logic to determine whether a zap should clear a
guard entry - this should only be performed on teardown and never on
MADV_DONTNEED or MADV_FREE.

We additionally add a WARN_ON_ONCE() in hugetlb logic should a guard
marker be encountered there, as we explicitly do not support this
operation and this should not occur.

Link: https://lkml.kernel.org/r/f47f3d5acca2dcf9bbf655b6d33f3dc713e4a4a0.1730123433.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Suggested-by: Vlastimil Babka <[email protected]>
Suggested-by: Jann Horn <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7
# e6c0c032 02-Jul-2024 Christophe Leroy <[email protected]>

mm: provide mm_struct and address to huge_ptep_get()

On powerpc 8xx huge_ptep_get() will need to know whether the given ptep is
a PTE entry or a PMD entry. This cannot be known with the PMD entry
i

mm: provide mm_struct and address to huge_ptep_get()

On powerpc 8xx huge_ptep_get() will need to know whether the given ptep is
a PTE entry or a PMD entry. This cannot be known with the PMD entry
itself because there is no easy way to know it from the content of the
entry.

So huge_ptep_get() will need to know either the size of the page or get
the pmd.

In order to be consistent with huge_ptep_get_and_clear(), give mm and
address to huge_ptep_get().

Link: https://lkml.kernel.org/r/cc00c70dd384298796a4e1b25d6c4eb306d3af85.1719928057.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3
# 07a57a33 07-Apr-2024 Oscar Salvador <[email protected]>

mm,swapops: update check in is_pfn_swap_entry for hwpoison entries

Tony reported that the Machine check recovery was broken in v6.9-rc1, as
he was hitting a VM_BUG_ON when injecting uncorrectable me

mm,swapops: update check in is_pfn_swap_entry for hwpoison entries

Tony reported that the Machine check recovery was broken in v6.9-rc1, as
he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to
DRAM.

After some more digging and debugging on his side, he realized that this
went back to v6.1, with the introduction of 'commit 0d206b5d2e0d
("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'. That
commit, among other things, introduced swp_offset_pfn(), replacing
hwpoison_entry_to_pfn() in its favour.

The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but
is_pfn_swap_entry() never got updated to cover hwpoison entries, which
means that we would hit the VM_BUG_ON whenever we would call
swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM
set. Fix this by updating the check to cover hwpoison entries as well,
and update the comment while we are it.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
Signed-off-by: Oscar Salvador <[email protected]>
Reported-by: Tony Luck <[email protected]>
Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/
Tested-by: Tony Luck <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Miaohe Lin <[email protected]>
Cc: <[email protected]> [6.1.x]
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1
# 5662400a 11-Jan-2024 Matthew Wilcox (Oracle) <[email protected]>

mm: add pfn_swap_entry_folio()

Patch series "mm: convert mm counter to take a folio", v3.

Make sure all mm_counter() and mm_counter_file() callers have a folio,
then convert mm counter functions to

mm: add pfn_swap_entry_folio()

Patch series "mm: convert mm counter to take a folio", v3.

Make sure all mm_counter() and mm_counter_file() callers have a folio,
then convert mm counter functions to take a folio, which saves some
compound_head() calls.


This patch (of 10):

Thanks to the compound_head() hidden inside PageLocked(), this saves a
call to compound_head() over calling page_folio(pfn_swap_entry_to_page())

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kefeng Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1
# af19487f 07-Jul-2023 Axel Rasmussen <[email protected]>

mm: make PTE_MARKER_SWAPIN_ERROR more general

Patch series "add UFFDIO_POISON to simulate memory poisoning with UFFD",
v4.

This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
fo

mm: make PTE_MARKER_SWAPIN_ERROR more general

Patch series "add UFFDIO_POISON to simulate memory poisoning with UFFD",
v4.

This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
for a detailed description of the feature.


This patch (of 8):

Future patches will reuse PTE_MARKER_SWAPIN_ERROR to implement
UFFDIO_POISON, so make some various preparations for that:

First, rename it to just PTE_MARKER_POISONED. The "SWAPIN" can be
confusing since we're going to re-use it for something not really related
to swap. This can be particularly confusing for things like hugetlbfs,
which doesn't support swap whatsoever. Also rename some various helper
functions.

Next, fix pte marker copying for hugetlbfs. Previously, it would WARN on
seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap.
But, since we're going to re-use it, we want it to go ahead and copy it
just like non-hugetlbfs memory does today. Since the code to do this is
more complicated now, pull it out into a helper which can be re-used in
both places. While we're at it, also make it slightly more explicit in
its handling of e.g. uffd wp markers.

For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for an
error entry, return VM_FAULT_HWPOISON. For most cases this change doesn't
matter, e.g. a userspace program would receive a SIGBUS either way. But
for UFFDIO_POISON, this change will let KVM guests get an MCE out of the
box, instead of giving a SIGBUS to the hypervisor and requiring it to
somehow inject an MCE.

Finally, for hugetlbfs faults, handle PTE_MARKER_POISONED, and return
VM_FAULT_HWPOISON_LARGE in such cases. Note that this can't happen today
because the lack of swap support means we'll never end up with such a PTE
anyway, but this behavior will be needed once such entries *can* show up
via UFFDIO_POISON.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gaosheng Cui <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.4, v6.4-rc7, v6.4-rc6
# 0cb8fd4d 09-Jun-2023 Hugh Dickins <[email protected]>

mm/migrate: remove cruft from migration_entry_wait()s

migration_entry_wait_on_locked() does not need to take a mapped pte
pointer, its callers can do the unmap first. Annotate it with
__releases(pt

mm/migrate: remove cruft from migration_entry_wait()s

migration_entry_wait_on_locked() does not need to take a mapped pte
pointer, its callers can do the unmap first. Annotate it with
__releases(ptl) to reduce sparse warnings.

Fold __migration_entry_wait_huge() into migration_entry_wait_huge(). Fold
__migration_entry_wait() into migration_entry_wait(), preferring the
tighter pte_offset_map_lock() to pte_offset_map() and pte_lockptr().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Hellström <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zack Rusin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1
# fcd48540 16-Dec-2022 Peter Xu <[email protected]>

mm/hugetlb: move swap entry handling into vma lock when faulted

In hugetlb_fault(), there used to have a special path to handle swap entry
at the entrance using huge_pte_offset(). That's unsafe bec

mm/hugetlb: move swap entry handling into vma lock when faulted

In hugetlb_fault(), there used to have a special path to handle swap entry
at the entrance using huge_pte_offset(). That's unsafe because
huge_pte_offset() for a pmd sharable range can access freed pgtables if
without any lock to protect the pgtable from being freed after pmd
unshare.

Here the simplest solution to make it safe is to move the swap handling to
be after the vma lock being held. We may need to take the fault mutex on
either migration or hwpoison entries now (also the vma lock, but that's
really needed), however neither of them is hot path.

Note that the vma lock cannot be released in hugetlb_fault() when the
migration entry is detected, because in migration_entry_wait_huge() the
pgtable page will be used again (by taking the pgtable lock), so that also
need to be protected by the vma lock. Modify migration_entry_wait_huge()
so that it must be called with vma read lock held, and properly release
the lock in __migration_entry_wait_huge().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.1
# 630dc25e 05-Dec-2022 David Hildenbrand <[email protected]>

mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit

We use "unsigned long" to store a PFN in the kernel and phys_addr_t to
store a physical address.

On a 64bit system, both are 64bit w

mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit

We use "unsigned long" to store a PFN in the kernel and phys_addr_t to
store a physical address.

On a 64bit system, both are 64bit wide. However, on a 32bit system, the
latter might be 64bit wide. This is, for example, the case on x86 with
PAE: phys_addr_t and PTEs are 64bit wide, while "unsigned long" only spans
32bit.

The current definition of SWP_PFN_BITS without MAX_PHYSMEM_BITS misses
that case, and assumes that the maximum PFN is limited by an 32bit
phys_addr_t. This implies, that SWP_PFN_BITS will currently only be able
to cover 4 GiB - 1 on any 32bit system with 4k page size, which is wrong.

Let's rely on the number of bits in phys_addr_t instead, but make sure to
not exceed the maximum swap offset, to not make the BUILD_BUG_ON() in
is_pfn_swap_entry() unhappy. Note that swp_entry_t is effectively an
unsigned long and the maximum swap offset shares that value with the swap
type.

For example, on an 8 GiB x86 PAE system with a kernel config based on
Debian 11.5 (-> CONFIG_FLATMEM=y, CONFIG_X86_PAE=y), we will currently
fail removing migration entries (remove_migration_ptes()), because
mm/page_vma_mapped.c:check_pte() will fail to identify a PFN match as
swp_offset_pfn() wrongly masks off PFN bits. For example,
split_huge_page_to_list()->...->remap_page() will leave migration entries
in place and continue to unlock the page.

Later, when we stumble over these migration entries (e.g., via
/proc/self/pagemap), pfn_swap_entry_to_page() will BUG_ON() because these
migration entries shouldn't exist anymore and the page was unlocked.

[ 33.067591] kernel BUG at include/linux/swapops.h:497!
[ 33.067597] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 33.067602] CPU: 3 PID: 742 Comm: cow Tainted: G E 6.1.0-rc8+ #16
[ 33.067605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[ 33.067606] EIP: pagemap_pmd_range+0x644/0x650
[ 33.067612] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 48 c6 52 00 e9 23 fb ff ff e8 61 83 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
[ 33.067615] EAX: ee394000 EBX: 00000002 ECX: ee394000 EDX: 00000000
[ 33.067617] ESI: c1b0ded4 EDI: 00024a00 EBP: c1b0ddb4 ESP: c1b0dd68
[ 33.067619] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
[ 33.067624] CR0: 80050033 CR2: b7a00000 CR3: 01bbbd20 CR4: 00350ef0
[ 33.067625] Call Trace:
[ 33.067628] ? madvise_free_pte_range+0x720/0x720
[ 33.067632] ? smaps_pte_range+0x4b0/0x4b0
[ 33.067634] walk_pgd_range+0x325/0x720
[ 33.067637] ? mt_find+0x1d6/0x3a0
[ 33.067641] ? mt_find+0x1d6/0x3a0
[ 33.067643] __walk_page_range+0x164/0x170
[ 33.067646] walk_page_range+0xf9/0x170
[ 33.067648] ? __kmem_cache_alloc_node+0x2a8/0x340
[ 33.067653] pagemap_read+0x124/0x280
[ 33.067658] ? default_llseek+0x101/0x160
[ 33.067662] ? smaps_account+0x1d0/0x1d0
[ 33.067664] vfs_read+0x90/0x290
[ 33.067667] ? do_madvise.part.0+0x24b/0x390
[ 33.067669] ? debug_smp_processor_id+0x12/0x20
[ 33.067673] ksys_pread64+0x58/0x90
[ 33.067675] __ia32_sys_ia32_pread64+0x1b/0x20
[ 33.067680] __do_fast_syscall_32+0x4c/0xc0
[ 33.067683] do_fast_syscall_32+0x29/0x60
[ 33.067686] do_SYSENTER_32+0x15/0x20
[ 33.067689] entry_SYSENTER_32+0x98/0xf1

Decrease the indentation level of SWP_PFN_BITS and SWP_PFN_MASK to keep it
readable and consistent.

[[email protected]: rely on sizeof(phys_addr_t) and min_t() instead]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: use "int" for comparison, as we're only comparing numbers < 64]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Peter Xu <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3
# 15520a3f 30-Oct-2022 Peter Xu <[email protected]>

mm: use pte markers for swap errors

PTE markers are ideal mechanism for things like SWP_SWAPIN_ERROR. Using a
whole swap entry type for this purpose can be an overkill, especially if
we already hav

mm: use pte markers for swap errors

PTE markers are ideal mechanism for things like SWP_SWAPIN_ERROR. Using a
whole swap entry type for this purpose can be an overkill, especially if
we already have PTE markers. Define a new bit for swapin error and
replace it with pte markers. Then we can safely drop SWP_SWAPIN_ERROR and
give one device slot back to swap.

We used to have SWP_SWAPIN_ERROR taking the page pfn as part of the swap
entry, but it's never used. Neither do I see how it can be useful because
normally the swapin failure should not be caused by a bad page but bad
swap device. Drop it alongside.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Huang Ying <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# ca92ea3d 30-Oct-2022 Peter Xu <[email protected]>

mm: always compile in pte markers

Patch series "mm: Use pte marker for swapin errors".

This series uses the pte marker to replace the swapin error swap entry,
then we save one more swap entry slot

mm: always compile in pte markers

Patch series "mm: Use pte marker for swapin errors".

This series uses the pte marker to replace the swapin error swap entry,
then we save one more swap entry slot for swap devices. A new pte marker
bit is defined.


This patch (of 2):

The PTE markers code is tiny and now it's enabled for most of the
distributions. It's fine to keep it as-is, but to make a broader use of
it (e.g. replacing read error swap entry) it needs to be there always
otherwise we need special code path to take care of !PTE_MARKER case.

It'll be easier just make pte marker always exist. Use this chance to
extend its usage to anonymous too by simply touching up some of the old
comments, because it'll be used for anonymous pages in the follow up
patches.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Huang Ying <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# d027122d 24-Oct-2022 Naoya Horiguchi <[email protected]>

mm/hwpoison: move definitions of num_poisoned_pages_* to memory-failure.c

These interfaces will be used by drivers/base/memory.c by later patch, so
as a preparatory work move them to more common hea

mm/hwpoison: move definitions of num_poisoned_pages_* to memory-failure.c

These interfaces will be used by drivers/base/memory.c by later patch, so
as a preparatory work move them to more common header file visible to the
file.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1
# 5154e607 11-Aug-2022 Peter Xu <[email protected]>

mm/swap: cache swap migration A/D bits support

Introduce a variable swap_migration_ad_supported to cache whether the arch
supports swap migration A/D bits.

Here one thing to mention is that SWP_MIG

mm/swap: cache swap migration A/D bits support

Introduce a variable swap_migration_ad_supported to cache whether the arch
supports swap migration A/D bits.

Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
reference the other macro MAX_PHYSMEM_BITS, which is a function call on
x86 (constant on all the rest of archs).

It's safe to reference it in swapfile_init() because when reaching here
we're already during initcalls level 4 so we must have initialized 5-level
pgtable for x86_64 (right after early_identify_cpu() finishes).

- start_kernel
- setup_arch
- early_cpu_init
- get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
- early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
- arch_call_rest_init
- rest_init
- kernel_init
- kernel_init_freeable
- do_basic_setup
- do_initcalls --> calls swapfile_init() (initcall level 4)

This should slightly speed up the migration swap entry handlings.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# be45a490 11-Aug-2022 Peter Xu <[email protected]>

mm/swap: cache maximum swapfile size when init swap

We used to have swapfile_maximum_size() fetching a maximum value of
swapfile size per-arch.

As the caller of max_swapfile_size() grows, this patc

mm/swap: cache maximum swapfile size when init swap

We used to have swapfile_maximum_size() fetching a maximum value of
swapfile size per-arch.

As the caller of max_swapfile_size() grows, this patch introduce a
variable "swapfile_maximum_size" and cache the value of old
max_swapfile_size(), so that we don't need to calculate the value every
time.

Caching the value in swapfile_init() is safe because when reaching the
phase we should have initialized all the relevant information. Here the
major arch to take care of is x86, which defines the max swapfile size
based on L1TF mitigation.

Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
when reaching swapfile_init(). As a reference, the code path looks like
this for x86:

- start_kernel
- setup_arch
- early_cpu_init
- early_identify_cpu --> setup X86_BUG_L1TF
- parse_early_param
- l1tf_cmdline --> set l1tf_mitigation
- check_bugs
- l1tf_select_mitigation --> set l1tf_mitigation
- arch_call_rest_init
- rest_init
- kernel_init
- kernel_init_freeable
- do_basic_setup
- do_initcalls --> calls swapfile_init() (initcall level 4)

The swapfile size only depends on swp pte format on non-x86 archs, so
caching it is safe too.

Since at it, rename max_swapfile_size() to arch_max_swapfile_size()
because arch can define its own function, so it's more straightforward to
have "arch_" as its prefix. At the meantime, export swapfile_maximum_size
to replace the old usages of max_swapfile_size().

[[email protected]: declare arch_max_swapfile_size) in swapfile.h]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 2e346877 11-Aug-2022 Peter Xu <[email protected]>

mm: remember young/dirty bit for page migrations

When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table
usin

mm: remember young/dirty bit for page migrations

When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table
using either pte_mkold() or pmd_mkold(), and keeping the pte clean.

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure.
Not to mention hardware setting the young bit can bring quite some
overhead on some systems, e.g. x86_64 needs a few hundreds nanoseconds to
set the bit. The same slowdown problem to dirty bits when the memory is
first written after page migration happened.

Actually we can easily remember the A/D bit configuration and recover the
information after the page is migrated. To achieve it, define a new set
of bits in the migration swap offset field to cache the A/D bits for old
pte. Then when removing/recovering the migration entry, we can recover
the A/D bits even if the page changed.

One thing to mention is that here we used max_swapfile_size() to detect
how many swp offset bits we have, and we'll only enable this feature if we
know the swp offset is big enough to store both the PFN value and the A/D
bits. Otherwise the A/D bits are dropped like before.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 0d206b5d 11-Aug-2022 Peter Xu <[email protected]>

mm/swap: add swp_offset_pfn() to fetch PFN from swap entry

We've got a bunch of special swap entries that stores PFN inside the swap
offset fields. To fetch the PFN, normally the user just calls
sw

mm/swap: add swp_offset_pfn() to fetch PFN from swap entry

We've got a bunch of special swap entries that stores PFN inside the swap
offset fields. To fetch the PFN, normally the user just calls
swp_offset() assuming that'll be the PFN.

Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check
with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().

One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN. At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp
offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
any more very soon.

Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper. Note that many of the existing users are not candidates for the
replacement, e.g.:

(1) When the swap entry is not a pfn swap entry at all, or,
(2) when we wanna keep the whole swp_offset but only change the swp type.

For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only". They're left aside so that when there're more
information within the swp offset they'll be carried over naturally in
those cases.

Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# eba4d770 11-Aug-2022 Peter Xu <[email protected]>

mm/swap: comment all the ifdef in swapops.h

swapops.h contains quite a few layers of ifdef, some of the "else" and
"endif" doesn't get proper comment on the macro so it's hard to follow on
what are

mm/swap: comment all the ifdef in swapops.h

swapops.h contains quite a few layers of ifdef, some of the "else" and
"endif" doesn't get proper comment on the macro so it's hard to follow on
what are they referring to. Add the comments.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Suggested-by: Nadav Amit <[email protected]>
Reviewed-by: Huang Ying <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 21c9e90a 30-Aug-2022 Miaohe Lin <[email protected]>

mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages

Use num_poisoned_pages_sub() to combine multiple atomic ops into one. Also
num_poisoned_pages_dec() can be killed as there's

mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages

Use num_poisoned_pages_sub() to combine multiple atomic ops into one. Also
num_poisoned_pages_dec() can be killed as there's no caller now.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v5.19, v5.19-rc8, v5.19-rc7
# ac5fcde0 14-Jul-2022 Naoya Horiguchi <[email protected]>

mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage

Raw error info list needs to be removed when hwpoisoned hugetlb is
unpoisoned. And unpoison handler needs to know how many

mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage

Raw error info list needs to be removed when hwpoisoned hugetlb is
unpoisoned. And unpoison handler needs to know how many errors there are
in the target hugepage. So add them.

HPageVmemmapOptimized(hpage) and HPageRawHwpUnreliable(hpage)) sometimes
can't be unpoisoned, so skip them.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Liu Shixin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1
# ad1ac596 30-May-2022 Miaohe Lin <[email protected]>

mm/migration: fix potential pte_unmap on an not mapped pte

__migration_entry_wait and migration_entry_wait_on_locked assume pte is
always mapped from caller. But this is not the case when it's call

mm/migration: fix potential pte_unmap on an not mapped pte

__migration_entry_wait and migration_entry_wait_on_locked assume pte is
always mapped from caller. But this is not the case when it's called from
migration_entry_wait_huge and follow_huge_pmd. Add a hugetlbfs variant
that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 30dad30922cc ("mm: migration: add migrate_entry_wait_huge()")
Signed-off-by: Miaohe Lin <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Howells <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v5.18
# 9f186f9e 19-May-2022 Miaohe Lin <[email protected]>

mm/swapfile: unuse_pte can map random data if swap read fails

Patch series "A few fixup patches for mm", v4.

This series contains a few patches to avoid mapping random data if swap
read fails and f

mm/swapfile: unuse_pte can map random data if swap read fails

Patch series "A few fixup patches for mm", v4.

This series contains a few patches to avoid mapping random data if swap
read fails and fix lost swap bits in unuse_pte. Also we free hwpoison and
swapin error entry in madvise_free_pte_range and so on. More details can
be found in the respective changelogs.


This patch (of 5):

There is a bug in unuse_pte(): when swap page happens to be unreadable,
page filled with random data is mapped into user address space. In case
of error, a special swap entry indicating swap read fails is set to the
page table. So the swapcache page can be freed and the user won't end up
with a permanently mounted swap because a sector is bad. And if the page
is accessed later, the user process will be killed so that corrupted data
is never consumed. On the other hand, if the page is never accessed, the
user won't even notice it.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Howells <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# a930c210 19-May-2022 Miaohe Lin <[email protected]>

mm/swap: fix the obsolete comment for SWP_TYPE_SHIFT

Since commit 3159f943aafd ("xarray: Replace exceptional entries"), there
is only one bit of 'type' can be shifted up. Update the corresponding
c

mm/swap: fix the obsolete comment for SWP_TYPE_SHIFT

Since commit 3159f943aafd ("xarray: Replace exceptional entries"), there
is only one bit of 'type' can be shifted up. Update the corresponding
comment.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: David Howells <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v5.18-rc7
# 1db9dbc2 13-May-2022 Peter Xu <[email protected]>

mm/uffd: PTE_MARKER_UFFD_WP

This patch introduces the 1st user of pte marker: the uffd-wp marker.

When the pte marker is installed with the uffd-wp bit set, it means this
pte was wr-protected by uf

mm/uffd: PTE_MARKER_UFFD_WP

This patch introduces the 1st user of pte marker: the uffd-wp marker.

When the pte marker is installed with the uffd-wp bit set, it means this
pte was wr-protected by uffd.

We will use this special pte to arm the ptes that got either unmapped or
swapped out for a file-backed region that was previously wr-protected.
This special pte could trigger a page fault just like swap entries.

This idea is greatly inspired by Hugh and Andrea in the discussion, which
is referenced in the links below.

Some helpers are introduced to detect whether a swap pte is uffd
wr-protected. After the pte marker introduced, one swap pte can be
wr-protected in two forms: either it is a normal swap pte and it has
_PAGE_SWP_UFFD_WP set, or it's a pte marker that has PTE_MARKER_UFFD_WP
set.

[[email protected]: fixup]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Suggested-by: Andrea Arcangeli <[email protected]>
Suggested-by: Hugh Dickins <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 679d1033 13-May-2022 Peter Xu <[email protected]>

mm: introduce PTE_MARKER swap entry

Patch series "userfaultfd-wp: Support shmem and hugetlbfs", v8.


Overview
========

Userfaultfd-wp anonymous support was merged two years ago. There're quite
a

mm: introduce PTE_MARKER swap entry

Patch series "userfaultfd-wp: Support shmem and hugetlbfs", v8.


Overview
========

Userfaultfd-wp anonymous support was merged two years ago. There're quite
a few applications that started to leverage this capability either to take
snapshots for user-app memory, or use it for full user controled swapping.

This series tries to complete the feature for uffd-wp so as to cover all
the RAM-based memory types. So far uffd-wp is the only missing piece of
the rest features (uffd-missing & uffd-minor mode).

One major reason to do so is that anonymous pages are sometimes not
satisfying the need of applications, and there're growing users of either
shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem
between hypervisor process and device emulation process, shmem local live
migration for upgrades), or for performance on tlb hits.

All these mean that if a uffd-wp app wants to switch to any of the memory
types, it'll stop working. I think it's worthwhile to have the kernel to
cover all these aspects.

This series chose to protect pages in pte level not page level.

One major reason is safety. I have no idea how we could make it safe if
any of the uffd-privileged app can wr-protect a page that any other
application can use. It means this app can block any process potentially
for any time it wants.

The other reason is that it aligns very well with not only the anonymous
uffd-wp solution, but also uffd as a whole. For example, userfaultfd is
implemented fundamentally based on VMAs. We set flags to VMAs showing the
status of uffd tracking. For another per-page based protection solution,
it'll be crossing the fundation line on VMA-based, and it could simply be
too far away already from what's called userfaultfd.

PTE markers
===========

The patchset is based on the idea called PTE markers. It was discussed in
one of the mm alignment sessions, proposed starting from v6, and this is
the 2nd version of it using PTE marker idea.

PTE marker is a new type of swap entry that is ony applicable to file
backed memories like shmem and hugetlbfs. It's used to persist some
pte-level information even if the original present ptes in pgtable are
zapped.

Logically pte markers can store more than uffd-wp information, but so far
only one bit is used for uffd-wp purpose. When the pte marker is
installed with uffd-wp bit set, it means this pte is wr-protected by uffd.

It solves the problem on e.g. file-backed memory mapped ptes got zapped
due to any reason (e.g. thp split, or swapped out), we can still keep the
wr-protect information in the ptes. Then when the page fault triggers
again, we'll know this pte is wr-protected so we can treat the pte the
same as a normal uffd wr-protected pte.

The extra information is encoded into the swap entry, or swp_offset to be
explicit, with the swp_type being PTE_MARKER. So far uffd-wp only uses
one bit out of the swap entry, the rest bits of swp_offset are still
reserved for other purposes.

There're two configs to enable/disable PTE markers:

CONFIG_PTE_MARKER
CONFIG_PTE_MARKER_UFFD_WP

We can set !PTE_MARKER to completely disable all the PTE markers, along
with uffd-wp support. I made two config so we can also enable PTE marker
but disable uffd-wp file-backed for other purposes. At the end of current
series, I'll enable CONFIG_PTE_MARKER by default, but that patch is
standalone and if anyone worries about having it by default, we can also
consider turn it off by dropping that oneliner patch. So far I don't see
a huge risk of doing so, so I kept that patch.

In most cases, PTE markers should be treated as none ptes. It is because
that unlike most of the other swap entry types, there's no PFN or block
offset information encoded into PTE markers but some extra well-defined
bits showing the status of the pte. These bits should only be used as
extra data when servicing an upcoming page fault, and then we behave as if
it's a none pte.

I did spend a lot of time observing all the pte_none() users this time.
It is indeed a challenge because there're a lot, and I hope I didn't miss
a single of them when we should take care of pte markers. Luckily, I
don't think it'll need to be considered in many cases, for example: boot
code, arch code (especially non-x86), kernel-only page handlings (e.g.
CPA), or device driver codes when we're tackling with pure PFN mappings.

I introduced pte_none_mostly() in this series when we need to handle pte
markers the same as none pte, the "mostly" is the other way to write
"either none pte or a pte marker".

I didn't replace pte_none() to cover pte markers for below reasons:

- Very rare case of pte_none() callers will handle pte markers. E.g., all
the kernel pages do not require knowledge of pte markers. So we don't
pollute the major use cases.

- Unconditionally change pte_none() semantics could confuse people, because
pte_none() existed for so long a time.

- Unconditionally change pte_none() semantics could make pte_none() slower
even if in many cases pte markers do not exist.

- There're cases where we'd like to handle pte markers differntly from
pte_none(), so a full replace is also impossible. E.g. khugepaged should
still treat pte markers as normal swap ptes rather than none ptes, because
pte markers will always need a fault-in to merge the marker with a valid
pte. Or the smap code will need to parse PTE markers not none ptes.

Patch Layout
============

Introducing PTE marker and uffd-wp bit in PTE marker:

mm: Introduce PTE_MARKER swap entry
mm: Teach core mm about pte markers
mm: Check against orig_pte for finish_fault()
mm/uffd: PTE_MARKER_UFFD_WP

Adding support for shmem uffd-wp:

mm/shmem: Take care of UFFDIO_COPY_MODE_WP
mm/shmem: Handle uffd-wp special pte in page fault handler
mm/shmem: Persist uffd-wp bit across zapping for file-backed
mm/shmem: Allow uffd wr-protect none pte for file-backed mem
mm/shmem: Allows file-back mem to be uffd wr-protected on thps
mm/shmem: Handle uffd-wp during fork()

Adding support for hugetlbfs uffd-wp:

mm/hugetlb: Introduce huge pte version of uffd-wp helpers
mm/hugetlb: Hook page faults for uffd write protection
mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
mm/hugetlb: Handle UFFDIO_WRITEPROTECT
mm/hugetlb: Handle pte markers in page faults
mm/hugetlb: Allow uffd wr-protect none ptes
mm/hugetlb: Only drop uffd-wp special pte if required
mm/hugetlb: Handle uffd-wp during fork()

Misc handling on the rest mm for uffd-wp file-backed:

mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

Enabling of uffd-wp on file-backed memory:

mm/uffd: Enable write protection for shmem & hugetlbfs
mm: Enable PTE markers by default
selftests/uffd: Enable uffd-wp for shmem/hugetlbfs

Tests
=====

- Compile test on x86_64 and aarch64 on different configs
- Kernel selftests
- uffd-test [0]
- Umapsort [1,2] test for shmem/hugetlb, with swap on/off

[0] https://github.com/xzpeter/clibs/tree/master/uffd-test
[1] https://github.com/xzpeter/umap-apps/tree/peter
[2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs


This patch (of 23):

Introduces a new swap entry type called PTE_MARKER. It can be installed
for any pte that maps a file-backed memory when the pte is temporarily
zapped, so as to maintain per-pte information.

The information that kept in the pte is called a "marker". Here we define
the marker as "unsigned long" just to match pgoff_t, however it will only
work if it still fits in swp_offset(), which is e.g. currently 58 bits on
x86_64.

A new config CONFIG_PTE_MARKER is introduced too; it's by default off. A
bunch of helpers are defined altogether to service the rest of the pte
marker code.

[[email protected]: fixup]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# b304c6f0 10-May-2022 Hongchen Zhang <[email protected]>

mm/swapops: make is_pmd_migration_entry more strict

A pmd migration entry should first be a swap pmd,so use is_swap_pmd(pmd)
instead of !pmd_present(pmd).

On the other hand, some architecture (MIPS

mm/swapops: make is_pmd_migration_entry more strict

A pmd migration entry should first be a swap pmd,so use is_swap_pmd(pmd)
instead of !pmd_present(pmd).

On the other hand, some architecture (MIPS for example) may misjudge a
pmd_none entry as a pmd migration entry.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hongchen Zhang <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


123