History log of /linux-6.15/include/uapi/linux/kvm.h (Results 1 – 25 of 304)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5
# 3adaee78 25-Feb-2025 Sebastian Ott <[email protected]>

KVM: arm64: Allow userspace to change the implementation ID registers

KVM's treatment of the ID registers that describe the implementation
(MIDR, REVIDR, and AIDR) is interesting, to say the least.

KVM: arm64: Allow userspace to change the implementation ID registers

KVM's treatment of the ID registers that describe the implementation
(MIDR, REVIDR, and AIDR) is interesting, to say the least. On the
userspace-facing end of it, KVM presents the values of the boot CPU on
all vCPUs and treats them as invariant. On the guest side of things KVM
presents the hardware values of the local CPU, which can change during
CPU migration in a big-little system.

While one may call this fragile, there is at least some degree of
predictability around it. For example, if a VMM wanted to present
big-little to a guest, it could affine vCPUs accordingly to the correct
clusters.

All of this makes a giant mess out of adding support for making these
implementation ID registers writable. Avoid breaking the rather subtle
ABI around the old way of doing things by requiring opt-in from
userspace to make the registers writable.

When the cap is enabled, allow userspace to set MIDR, REVIDR, and AIDR
to any non-reserved value and present those values consistently across
all vCPUs.

Signed-off-by: Sebastian Ott <[email protected]>
[oliver: changelog, capability]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>

show more ...


Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1
# af5366be 28-Nov-2024 Sean Christopherson <[email protected]>

KVM: x86: Drop the now unused KVM_X86_DISABLE_VALID_EXITS

Drop the KVM_X86_DISABLE_VALID_EXITS definition, as it is misleading, and
unused in KVM *because* it is misleading. The set of exits that c

KVM: x86: Drop the now unused KVM_X86_DISABLE_VALID_EXITS

Drop the KVM_X86_DISABLE_VALID_EXITS definition, as it is misleading, and
unused in KVM *because* it is misleading. The set of exits that can be
disabled is dynamic, i.e. userspace (and KVM) must check KVM's actual
capabilities.

Suggested-by: Xiaoyao Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

show more ...


# 915d2f07 28-Nov-2024 Sean Christopherson <[email protected]>

KVM: Move KVM_REG_SIZE() definition to common uAPI header

Define KVM_REG_SIZE() in the common kvm.h header, and delete the arm64 and
RISC-V versions. As evidenced by the surrounding definitions, al

KVM: Move KVM_REG_SIZE() definition to common uAPI header

Define KVM_REG_SIZE() in the common kvm.h header, and delete the arm64 and
RISC-V versions. As evidenced by the surrounding definitions, all aspects
of the register size encoding are generic, i.e. RISC-V should have moved
arm64's definition to common code instead of copy+pasting.

Acked-by: Anup Patel <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Muhammad Usama Anjum <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

show more ...


Revision tags: v6.12
# e785dfac 13-Nov-2024 Xianglai Li <[email protected]>

LoongArch: KVM: Add PCHPIC device support

Add device model for PCHPIC interrupt controller, implemente basic
create & destroy interface, and register device model to kvm device
table.

Signed-off-by

LoongArch: KVM: Add PCHPIC device support

Add device model for PCHPIC interrupt controller, implemente basic
create & destroy interface, and register device model to kvm device
table.

Signed-off-by: Tianrui Zhao <[email protected]>
Signed-off-by: Xianglai Li <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

show more ...


# 2e8b9df8 13-Nov-2024 Xianglai Li <[email protected]>

LoongArch: KVM: Add EIOINTC device support

Add device model for EIOINTC interrupt controller, implement basic
create & destroy interfaces, and register device model to kvm device
table.

Signed-off-

LoongArch: KVM: Add EIOINTC device support

Add device model for EIOINTC interrupt controller, implement basic
create & destroy interfaces, and register device model to kvm device
table.

Signed-off-by: Tianrui Zhao <[email protected]>
Signed-off-by: Xianglai Li <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

show more ...


# c532de5a 13-Nov-2024 Xianglai Li <[email protected]>

LoongArch: KVM: Add IPI device support

Add device model for IPI interrupt controller, implement basic create &
destroy interfaces, and register device model to kvm device table.

Signed-off-by: Tian

LoongArch: KVM: Add IPI device support

Add device model for IPI interrupt controller, implement basic create &
destroy interfaces, and register device model to kvm device table.

Signed-off-by: Tianrui Zhao <[email protected]>
Signed-off-by: Xianglai Li <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

show more ...


Revision tags: v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4
# bc1a5cd0 10-Apr-2024 Isaku Yamahata <[email protected]>

KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory

Add a new ioctl KVM_PRE_FAULT_MEMORY in the KVM common code. It iterates on the
memory range and calls the arch-specific functio

KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory

Add a new ioctl KVM_PRE_FAULT_MEMORY in the KVM common code. It iterates on the
memory range and calls the arch-specific function. The implementation is
optional and enabled by a Kconfig symbol.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Message-ID: <819322b8f25971f2b9933bfa4506e618508ad782.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# 4b23e0c1 03-May-2024 David Matlack <[email protected]>

KVM: Ensure new code that references immediate_exit gets extra scrutiny

Ensure that any new KVM code that references immediate_exit gets extra
scrutiny by renaming it to immediate_exit__unsafe in ke

KVM: Ensure new code that references immediate_exit gets extra scrutiny

Ensure that any new KVM code that references immediate_exit gets extra
scrutiny by renaming it to immediate_exit__unsafe in kernel code.

All fields in struct kvm_run are subject to TOCTOU races since they are
mapped into userspace, which may be malicious or buggy. To protect KVM,
introduces a new macro that appends __unsafe to select field names in
struct kvm_run, hinting to developers and reviewers that accessing such
fields must be done carefully.

Apply the new macro to immediate_exit, since userspace can make
immediate_exit inconsistent with vcpu->wants_to_run, i.e. accessing
immediate_exit directly could lead to unexpected bugs in the future.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: massage changelog]
Signed-off-by: Sean Christopherson <[email protected]>

show more ...


# 85542adb 08-May-2024 Thomas Prescher <[email protected]>

KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flag

When a vCPU is interrupted by a signal while running a nested guest,
KVM will exit to userspace with L2 state. However, userspace has no
way to know

KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flag

When a vCPU is interrupted by a signal while running a nested guest,
KVM will exit to userspace with L2 state. However, userspace has no
way to know whether it sees L1 or L2 state (besides calling
KVM_GET_STATS_FD, which does not have a stable ABI).

This causes multiple problems:

The simplest one is L2 state corruption when userspace marks the sregs
as dirty. See this mailing list thread [1] for a complete discussion.

Another problem is that if userspace decides to continue by emulating
instructions, it will unknowingly emulate with L2 state as if L1
doesn't exist, which can be considered a weird guest escape.

Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data
structure, which is set when the vCPU exited while running a nested
guest. Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to
advertise the functionality to userspace.

[1] https://lore.kernel.org/kvm/[email protected]/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55

Signed-off-by: Thomas Prescher <[email protected]>
Signed-off-by: Julian Stecklina <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

show more ...


# 6fef5185 25-Apr-2024 Isaku Yamahata <[email protected]>

KVM: x86: Add a capability to configure bus frequency for APIC timer

Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC
bus clock frequency for APIC timer emulation.
Allow KVM_ENABL

KVM: x86: Add a capability to configure bus frequency for APIC timer

Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC
bus clock frequency for APIC timer emulation.
Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the
frequency in nanoseconds. When using this capability, the user space
VMM should configure CPUID leaf 0x15 to advertise the frequency.

Vishal reported that the TDX guest kernel expects a 25MHz APIC bus
frequency but ends up getting interrupts at a significantly higher rate.

The TDX architecture hard-codes the core crystal clock frequency to
25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
does not allow the VMM to override the value.

In addition, per Intel SDM:
"The APIC timer frequency will be the processor’s bus clock or core
crystal clock frequency (when TSC/core crystal clock ratio is
enumerated in CPUID leaf 0x15) divided by the value specified in
the divide configuration register."

The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
APIC bus frequency of 1GHz.

The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user
space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is
enumerated, the guest kernel uses it as the APIC bus frequency. If not,
the guest kernel measures the frequency based on other known timers like
the ACPI timer or the legacy PIT. As reported by Vishal the TDX guest
kernel expects a 25MHz timer frequency but gets timer interrupt more
frequently due to the 1GHz frequency used by KVM.

To ensure that the guest doesn't have a conflicting view of the APIC bus
frequency, allow the userspace to tell KVM to use the same frequency that
TDX mandates instead of the default 1Ghz.

Reported-by: Vishal Annapurve <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Co-developed-by: Reinette Chatre <[email protected]>
Signed-off-by: Reinette Chatre <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Reviewed-by: Yuan Yao <[email protected]>
Link: https://lore.kernel.org/r/6748a4c12269e756f0c48680da8ccc5367c31ce7.1714081726.git.reinette.chatre@intel.com
Signed-off-by: Sean Christopherson <[email protected]>

show more ...


Revision tags: v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7
# 651d61bc 11-Apr-2023 Joel Stanley <[email protected]>

KVM: PPC: Fix documentation for ppc mmu caps

The documentation mentions KVM_CAP_PPC_RADIX_MMU, but the defines in the
kvm headers spell it KVM_CAP_PPC_MMU_RADIX. Similarly with
KVM_CAP_PPC_MMU_HASH_

KVM: PPC: Fix documentation for ppc mmu caps

The documentation mentions KVM_CAP_PPC_RADIX_MMU, but the defines in the
kvm headers spell it KVM_CAP_PPC_MMU_RADIX. Similarly with
KVM_CAP_PPC_MMU_HASH_V3.

Fixes: c92701322711 ("KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU")
Signed-off-by: Joel Stanley <[email protected]>
Acked-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]

show more ...


# db7d6fbc 11-Jan-2024 Paolo Bonzini <[email protected]>

KVM: remove unnecessary #ifdef

KVM_CAP_IRQ_ROUTING is always defined, so there is no need to check if it is.

Signed-off-by: Paolo Bonzini <[email protected]>


# 6bda055d 11-Jan-2024 Paolo Bonzini <[email protected]>

KVM: define __KVM_HAVE_GUEST_DEBUG unconditionally

Since all architectures (for historical reasons) have to define
struct kvm_guest_debug_arch, and since userspace has to check
KVM_CHECK_EXTENSION(K

KVM: define __KVM_HAVE_GUEST_DEBUG unconditionally

Since all architectures (for historical reasons) have to define
struct kvm_guest_debug_arch, and since userspace has to check
KVM_CHECK_EXTENSION(KVM_CAP_SET_GUEST_DEBUG) anyway, there is
no advantage in masking the capability #define itself. Remove
the #define __KVM_HAVE_GUEST_DEBUG from architecture-specific
headers.

Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# 5d9cb716 11-Jan-2024 Paolo Bonzini <[email protected]>

KVM: arm64: move ARM-specific defines to uapi/asm/kvm.h

While this in principle breaks userspace code that mentions KVM_ARM_DEV_*
on architectures other than aarch64, this seems unlikely to be
a pro

KVM: arm64: move ARM-specific defines to uapi/asm/kvm.h

While this in principle breaks userspace code that mentions KVM_ARM_DEV_*
on architectures other than aarch64, this seems unlikely to be
a problem considering that run->s.regs.device_irq_level is only
defined on that architecture.

Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# 71cd774a 11-Jan-2024 Paolo Bonzini <[email protected]>

KVM: s390: move s390-specific structs to uapi/asm/kvm.h

While this in principle breaks the appearance of KVM_S390_* ioctls on architectures
other than s390, this seems unlikely to be a problem consi

KVM: s390: move s390-specific structs to uapi/asm/kvm.h

While this in principle breaks the appearance of KVM_S390_* ioctls on architectures
other than s390, this seems unlikely to be a problem considering that there are
already many "struct kvm_s390_*" definitions in arch/s390/include/uapi.

Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# d750951c 11-Jan-2024 Paolo Bonzini <[email protected]>

KVM: powerpc: move powerpc-specific structs to uapi/asm/kvm.h

While this in principle breaks the appearance of KVM_PPC_* ioctls on architectures
other than powerpc, this seems unlikely to be a probl

KVM: powerpc: move powerpc-specific structs to uapi/asm/kvm.h

While this in principle breaks the appearance of KVM_PPC_* ioctls on architectures
other than powerpc, this seems unlikely to be a problem considering that there are
already many "struct kvm_ppc_*" definitions in arch/powerpc/include/uapi.

Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# bcac0477 11-Jan-2024 Paolo Bonzini <[email protected]>

KVM: x86: move x86-specific structs to uapi/asm/kvm.h

Several capabilities that exist only on x86 nevertheless have their
structs defined in include/uapi/linux/kvm.h. Move them to
arch/x86/include/

KVM: x86: move x86-specific structs to uapi/asm/kvm.h

Several capabilities that exist only on x86 nevertheless have their
structs defined in include/uapi/linux/kvm.h. Move them to
arch/x86/include/uapi/asm/kvm.h for cleanliness.

Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# c0a41190 31-Jan-2024 Paolo Bonzini <[email protected]>

KVM: remove more traces of device assignment UAPI

Signed-off-by: Paolo Bonzini <[email protected]>


# a5d3df8a 08-Nov-2023 Paolo Bonzini <[email protected]>

KVM: remove deprecated UAPIs

The deprecated interfaces were removed 15 years ago. KVM's
device assignment was deprecated in 4.2 and removed 6.5 years
ago; the only interest might be in compiling an

KVM: remove deprecated UAPIs

The deprecated interfaces were removed 15 years ago. KVM's
device assignment was deprecated in 4.2 and removed 6.5 years
ago; the only interest might be in compiling ancient versions
of QEMU, but QEMU has been using its own imported copy of the
kernel headers since June 2011. So again we go into archaeology
territory; just remove the cruft.

Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# 6d722835 02-Nov-2023 Paul Durrant <[email protected]>

KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BIT

Unless explicitly told to do so (by passing 'clocksource=tsc' and
'tsc=stable:socket', and then jumping through some hoops concerning
potentia

KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BIT

Unless explicitly told to do so (by passing 'clocksource=tsc' and
'tsc=stable:socket', and then jumping through some hoops concerning
potential CPU hotplug) Xen will never use TSC as its clocksource.
Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set
in either the primary or secondary pvclock memory areas. This has
led to bugs in some guest kernels which only become evident if
PVCLOCK_TSC_STABLE_BIT *is* set in the pvclocks. Hence, to support
such guests, give the VMM a new Xen HVM config flag to tell KVM to
forcibly clear the bit in the Xen pvclocks.

Signed-off-by: Paul Durrant <[email protected]>
Reviewed-by: David Woodhouse <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

show more ...


# 89ea60c2 27-Oct-2023 Sean Christopherson <[email protected]>

KVM: x86: Add support for "protected VMs" that can utilize private memory

Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, a

KVM: x86: Add support for "protected VMs" that can utilize private memory

Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# 8dd2eee9 27-Oct-2023 Chao Peng <[email protected]>

KVM: x86/mmu: Handle page fault for private memory

Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_

KVM: x86/mmu: Handle page fault for private memory

Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.

For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.

Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.

Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.

Co-developed-by: Yu Zhang <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# a7800aa8 13-Nov-2023 Sean Christopherson <[email protected]>

KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual mac

KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.

A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.

Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.

Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).

More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.

Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.

Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.

Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.

Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.

Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.

Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/[email protected]
Cc: Fuad Tabba <[email protected]>
Cc: Vishal Annapurve <[email protected]>
Cc: Ackerley Tng <[email protected]>
Cc: Jarkko Sakkinen <[email protected]>
Cc: Maciej Szmigiero <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Michael Roth <[email protected]>
Cc: Wang <[email protected]>
Cc: Liam Merwick <[email protected]>
Cc: Isaku Yamahata <[email protected]>
Co-developed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Co-developed-by: Yu Zhang <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Co-developed-by: Chao Peng <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
Co-developed-by: Ackerley Tng <[email protected]>
Signed-off-by: Ackerley Tng <[email protected]>
Co-developed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# 5a475554 27-Oct-2023 Chao Peng <[email protected]>

KVM: Introduce per-page memory attributes

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page

KVM: Introduce per-page memory attributes

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.

Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.

To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Cc: Fuad Tabba <[email protected]>
Cc: Xu Yilun <[email protected]>
Cc: Mickaël Salaün <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


# 16f95f3b 27-Oct-2023 Chao Peng <[email protected]>

KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to hand

KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory. With guest private memory,
there will be two kind of memory conversions:

- explicit conversion: happens when the guest explicitly calls into KVM
to map a range (as private or shared)

- implicit conversion: happens when the guest attempts to access a gfn
that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Note! To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'! Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages. It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.

Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Cc: Anish Moorthy <[email protected]>
Cc: David Matlack <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Co-developed-by: Yu Zhang <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

show more ...


12345678910>>...13