|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
3adaee78 |
| 25-Feb-2025 |
Sebastian Ott <[email protected]> |
KVM: arm64: Allow userspace to change the implementation ID registers
KVM's treatment of the ID registers that describe the implementation (MIDR, REVIDR, and AIDR) is interesting, to say the least.
KVM: arm64: Allow userspace to change the implementation ID registers
KVM's treatment of the ID registers that describe the implementation (MIDR, REVIDR, and AIDR) is interesting, to say the least. On the userspace-facing end of it, KVM presents the values of the boot CPU on all vCPUs and treats them as invariant. On the guest side of things KVM presents the hardware values of the local CPU, which can change during CPU migration in a big-little system.
While one may call this fragile, there is at least some degree of predictability around it. For example, if a VMM wanted to present big-little to a guest, it could affine vCPUs accordingly to the correct clusters.
All of this makes a giant mess out of adding support for making these implementation ID registers writable. Avoid breaking the rather subtle ABI around the old way of doing things by requiring opt-in from userspace to make the registers writable.
When the cap is enabled, allow userspace to set MIDR, REVIDR, and AIDR to any non-reserved value and present those values consistently across all vCPUs.
Signed-off-by: Sebastian Ott <[email protected]> [oliver: changelog, capability] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Oliver Upton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
af5366be |
| 28-Nov-2024 |
Sean Christopherson <[email protected]> |
KVM: x86: Drop the now unused KVM_X86_DISABLE_VALID_EXITS
Drop the KVM_X86_DISABLE_VALID_EXITS definition, as it is misleading, and unused in KVM *because* it is misleading. The set of exits that c
KVM: x86: Drop the now unused KVM_X86_DISABLE_VALID_EXITS
Drop the KVM_X86_DISABLE_VALID_EXITS definition, as it is misleading, and unused in KVM *because* it is misleading. The set of exits that can be disabled is dynamic, i.e. userspace (and KVM) must check KVM's actual capabilities.
Suggested-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
show more ...
|
| #
915d2f07 |
| 28-Nov-2024 |
Sean Christopherson <[email protected]> |
KVM: Move KVM_REG_SIZE() definition to common uAPI header
Define KVM_REG_SIZE() in the common kvm.h header, and delete the arm64 and RISC-V versions. As evidenced by the surrounding definitions, al
KVM: Move KVM_REG_SIZE() definition to common uAPI header
Define KVM_REG_SIZE() in the common kvm.h header, and delete the arm64 and RISC-V versions. As evidenced by the surrounding definitions, all aspects of the register size encoding are generic, i.e. RISC-V should have moved arm64's definition to common code instead of copy+pasting.
Acked-by: Anup Patel <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Reviewed-by: Muhammad Usama Anjum <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
show more ...
|
|
Revision tags: v6.12 |
|
| #
e785dfac |
| 13-Nov-2024 |
Xianglai Li <[email protected]> |
LoongArch: KVM: Add PCHPIC device support
Add device model for PCHPIC interrupt controller, implemente basic create & destroy interface, and register device model to kvm device table.
Signed-off-by
LoongArch: KVM: Add PCHPIC device support
Add device model for PCHPIC interrupt controller, implemente basic create & destroy interface, and register device model to kvm device table.
Signed-off-by: Tianrui Zhao <[email protected]> Signed-off-by: Xianglai Li <[email protected]> Signed-off-by: Huacai Chen <[email protected]>
show more ...
|
| #
2e8b9df8 |
| 13-Nov-2024 |
Xianglai Li <[email protected]> |
LoongArch: KVM: Add EIOINTC device support
Add device model for EIOINTC interrupt controller, implement basic create & destroy interfaces, and register device model to kvm device table.
Signed-off-
LoongArch: KVM: Add EIOINTC device support
Add device model for EIOINTC interrupt controller, implement basic create & destroy interfaces, and register device model to kvm device table.
Signed-off-by: Tianrui Zhao <[email protected]> Signed-off-by: Xianglai Li <[email protected]> Signed-off-by: Huacai Chen <[email protected]>
show more ...
|
| #
c532de5a |
| 13-Nov-2024 |
Xianglai Li <[email protected]> |
LoongArch: KVM: Add IPI device support
Add device model for IPI interrupt controller, implement basic create & destroy interfaces, and register device model to kvm device table.
Signed-off-by: Tian
LoongArch: KVM: Add IPI device support
Add device model for IPI interrupt controller, implement basic create & destroy interfaces, and register device model to kvm device table.
Signed-off-by: Tianrui Zhao <[email protected]> Signed-off-by: Xianglai Li <[email protected]> Signed-off-by: Huacai Chen <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4 |
|
| #
bc1a5cd0 |
| 10-Apr-2024 |
Isaku Yamahata <[email protected]> |
KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory
Add a new ioctl KVM_PRE_FAULT_MEMORY in the KVM common code. It iterates on the memory range and calls the arch-specific functio
KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory
Add a new ioctl KVM_PRE_FAULT_MEMORY in the KVM common code. It iterates on the memory range and calls the arch-specific function. The implementation is optional and enabled by a Kconfig symbol.
Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Reviewed-by: Rick Edgecombe <[email protected]> Message-ID: <819322b8f25971f2b9933bfa4506e618508ad782.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
4b23e0c1 |
| 03-May-2024 |
David Matlack <[email protected]> |
KVM: Ensure new code that references immediate_exit gets extra scrutiny
Ensure that any new KVM code that references immediate_exit gets extra scrutiny by renaming it to immediate_exit__unsafe in ke
KVM: Ensure new code that references immediate_exit gets extra scrutiny
Ensure that any new KVM code that references immediate_exit gets extra scrutiny by renaming it to immediate_exit__unsafe in kernel code.
All fields in struct kvm_run are subject to TOCTOU races since they are mapped into userspace, which may be malicious or buggy. To protect KVM, introduces a new macro that appends __unsafe to select field names in struct kvm_run, hinting to developers and reviewers that accessing such fields must be done carefully.
Apply the new macro to immediate_exit, since userspace can make immediate_exit inconsistent with vcpu->wants_to_run, i.e. accessing immediate_exit directly could lead to unexpected bugs in the future.
Signed-off-by: David Matlack <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: massage changelog] Signed-off-by: Sean Christopherson <[email protected]>
show more ...
|
| #
85542adb |
| 08-May-2024 |
Thomas Prescher <[email protected]> |
KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flag
When a vCPU is interrupted by a signal while running a nested guest, KVM will exit to userspace with L2 state. However, userspace has no way to know
KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flag
When a vCPU is interrupted by a signal while running a nested guest, KVM will exit to userspace with L2 state. However, userspace has no way to know whether it sees L1 or L2 state (besides calling KVM_GET_STATS_FD, which does not have a stable ABI).
This causes multiple problems:
The simplest one is L2 state corruption when userspace marks the sregs as dirty. See this mailing list thread [1] for a complete discussion.
Another problem is that if userspace decides to continue by emulating instructions, it will unknowingly emulate with L2 state as if L1 doesn't exist, which can be considered a weird guest escape.
Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data structure, which is set when the vCPU exited while running a nested guest. Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to advertise the functionality to userspace.
[1] https://lore.kernel.org/kvm/[email protected]/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55
Signed-off-by: Thomas Prescher <[email protected]> Signed-off-by: Julian Stecklina <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
show more ...
|
| #
6fef5185 |
| 25-Apr-2024 |
Isaku Yamahata <[email protected]> |
KVM: x86: Add a capability to configure bus frequency for APIC timer
Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC bus clock frequency for APIC timer emulation. Allow KVM_ENABL
KVM: x86: Add a capability to configure bus frequency for APIC timer
Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC bus clock frequency for APIC timer emulation. Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the frequency in nanoseconds. When using this capability, the user space VMM should configure CPUID leaf 0x15 to advertise the frequency.
Vishal reported that the TDX guest kernel expects a 25MHz APIC bus frequency but ends up getting interrupts at a significantly higher rate.
The TDX architecture hard-codes the core crystal clock frequency to 25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture does not allow the VMM to override the value.
In addition, per Intel SDM: "The APIC timer frequency will be the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register."
The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded APIC bus frequency of 1GHz.
The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is enumerated, the guest kernel uses it as the APIC bus frequency. If not, the guest kernel measures the frequency based on other known timers like the ACPI timer or the legacy PIT. As reported by Vishal the TDX guest kernel expects a 25MHz timer frequency but gets timer interrupt more frequently due to the 1GHz frequency used by KVM.
To ensure that the guest doesn't have a conflicting view of the APIC bus frequency, allow the userspace to tell KVM to use the same frequency that TDX mandates instead of the default 1Ghz.
Reported-by: Vishal Annapurve <[email protected]> Closes: https://lore.kernel.org/lkml/[email protected] Signed-off-by: Isaku Yamahata <[email protected]> Reviewed-by: Rick Edgecombe <[email protected]> Co-developed-by: Reinette Chatre <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Yuan Yao <[email protected]> Link: https://lore.kernel.org/r/6748a4c12269e756f0c48680da8ccc5367c31ce7.1714081726.git.reinette.chatre@intel.com Signed-off-by: Sean Christopherson <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7 |
|
| #
651d61bc |
| 11-Apr-2023 |
Joel Stanley <[email protected]> |
KVM: PPC: Fix documentation for ppc mmu caps
The documentation mentions KVM_CAP_PPC_RADIX_MMU, but the defines in the kvm headers spell it KVM_CAP_PPC_MMU_RADIX. Similarly with KVM_CAP_PPC_MMU_HASH_
KVM: PPC: Fix documentation for ppc mmu caps
The documentation mentions KVM_CAP_PPC_RADIX_MMU, but the defines in the kvm headers spell it KVM_CAP_PPC_MMU_RADIX. Similarly with KVM_CAP_PPC_MMU_HASH_V3.
Fixes: c92701322711 ("KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU") Signed-off-by: Joel Stanley <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
show more ...
|
| #
db7d6fbc |
| 11-Jan-2024 |
Paolo Bonzini <[email protected]> |
KVM: remove unnecessary #ifdef
KVM_CAP_IRQ_ROUTING is always defined, so there is no need to check if it is.
Signed-off-by: Paolo Bonzini <[email protected]>
|
| #
6bda055d |
| 11-Jan-2024 |
Paolo Bonzini <[email protected]> |
KVM: define __KVM_HAVE_GUEST_DEBUG unconditionally
Since all architectures (for historical reasons) have to define struct kvm_guest_debug_arch, and since userspace has to check KVM_CHECK_EXTENSION(K
KVM: define __KVM_HAVE_GUEST_DEBUG unconditionally
Since all architectures (for historical reasons) have to define struct kvm_guest_debug_arch, and since userspace has to check KVM_CHECK_EXTENSION(KVM_CAP_SET_GUEST_DEBUG) anyway, there is no advantage in masking the capability #define itself. Remove the #define __KVM_HAVE_GUEST_DEBUG from architecture-specific headers.
Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
5d9cb716 |
| 11-Jan-2024 |
Paolo Bonzini <[email protected]> |
KVM: arm64: move ARM-specific defines to uapi/asm/kvm.h
While this in principle breaks userspace code that mentions KVM_ARM_DEV_* on architectures other than aarch64, this seems unlikely to be a pro
KVM: arm64: move ARM-specific defines to uapi/asm/kvm.h
While this in principle breaks userspace code that mentions KVM_ARM_DEV_* on architectures other than aarch64, this seems unlikely to be a problem considering that run->s.regs.device_irq_level is only defined on that architecture.
Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
71cd774a |
| 11-Jan-2024 |
Paolo Bonzini <[email protected]> |
KVM: s390: move s390-specific structs to uapi/asm/kvm.h
While this in principle breaks the appearance of KVM_S390_* ioctls on architectures other than s390, this seems unlikely to be a problem consi
KVM: s390: move s390-specific structs to uapi/asm/kvm.h
While this in principle breaks the appearance of KVM_S390_* ioctls on architectures other than s390, this seems unlikely to be a problem considering that there are already many "struct kvm_s390_*" definitions in arch/s390/include/uapi.
Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
d750951c |
| 11-Jan-2024 |
Paolo Bonzini <[email protected]> |
KVM: powerpc: move powerpc-specific structs to uapi/asm/kvm.h
While this in principle breaks the appearance of KVM_PPC_* ioctls on architectures other than powerpc, this seems unlikely to be a probl
KVM: powerpc: move powerpc-specific structs to uapi/asm/kvm.h
While this in principle breaks the appearance of KVM_PPC_* ioctls on architectures other than powerpc, this seems unlikely to be a problem considering that there are already many "struct kvm_ppc_*" definitions in arch/powerpc/include/uapi.
Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
bcac0477 |
| 11-Jan-2024 |
Paolo Bonzini <[email protected]> |
KVM: x86: move x86-specific structs to uapi/asm/kvm.h
Several capabilities that exist only on x86 nevertheless have their structs defined in include/uapi/linux/kvm.h. Move them to arch/x86/include/
KVM: x86: move x86-specific structs to uapi/asm/kvm.h
Several capabilities that exist only on x86 nevertheless have their structs defined in include/uapi/linux/kvm.h. Move them to arch/x86/include/uapi/asm/kvm.h for cleanliness.
Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
c0a41190 |
| 31-Jan-2024 |
Paolo Bonzini <[email protected]> |
KVM: remove more traces of device assignment UAPI
Signed-off-by: Paolo Bonzini <[email protected]>
|
| #
a5d3df8a |
| 08-Nov-2023 |
Paolo Bonzini <[email protected]> |
KVM: remove deprecated UAPIs
The deprecated interfaces were removed 15 years ago. KVM's device assignment was deprecated in 4.2 and removed 6.5 years ago; the only interest might be in compiling an
KVM: remove deprecated UAPIs
The deprecated interfaces were removed 15 years ago. KVM's device assignment was deprecated in 4.2 and removed 6.5 years ago; the only interest might be in compiling ancient versions of QEMU, but QEMU has been using its own imported copy of the kernel headers since June 2011. So again we go into archaeology territory; just remove the cruft.
Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
6d722835 |
| 02-Nov-2023 |
Paul Durrant <[email protected]> |
KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BIT
Unless explicitly told to do so (by passing 'clocksource=tsc' and 'tsc=stable:socket', and then jumping through some hoops concerning potentia
KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BIT
Unless explicitly told to do so (by passing 'clocksource=tsc' and 'tsc=stable:socket', and then jumping through some hoops concerning potential CPU hotplug) Xen will never use TSC as its clocksource. Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set in either the primary or secondary pvclock memory areas. This has led to bugs in some guest kernels which only become evident if PVCLOCK_TSC_STABLE_BIT *is* set in the pvclocks. Hence, to support such guests, give the VMM a new Xen HVM config flag to tell KVM to forcibly clear the bit in the Xen pvclocks.
Signed-off-by: Paul Durrant <[email protected]> Reviewed-by: David Woodhouse <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
show more ...
|
| #
89ea60c2 |
| 27-Oct-2023 |
Sean Christopherson <[email protected]> |
KVM: x86: Add support for "protected VMs" that can utilize private memory
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development and testing vehicle for Confidential (CoCo) VMs, a
KVM: x86: Add support for "protected VMs" that can utilize private memory
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development and testing vehicle for Confidential (CoCo) VMs, and potentially to even become a "real" product in the distant future, e.g. a la pKVM.
The private memory support in KVM x86 is aimed at AMD's SEV-SNP and Intel's TDX, but those technologies are extremely complex (understatement), difficult to debug, don't support running as nested guests, and require hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX for maintaining guest private memory isn't a realistic option.
At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of selftests for guest_memfd and private memory support without requiring unique hardware.
Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Reviewed-by: Fuad Tabba <[email protected]> Tested-by: Fuad Tabba <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
8dd2eee9 |
| 27-Oct-2023 |
Chao Peng <[email protected]> |
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs that differentiate between "shared" and "private" memory. For such VMs, KVM_
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs that differentiate between "shared" and "private" memory. For such VMs, KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and hva-based shared memory, and KVM needs to map in the "correct" variant, i.e. KVM needs to map the gfn shared/private as appropriate based on the current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request shared vs. private via a bit in the guest page tables, i.e. what the guest wants may conflict with the current memory attributes. To support such "implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT to forward the request to userspace. Add a new flag for memory faults, KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to exit on missing mappings when handling guest page fault VM-Exits. In that case, userspace will want to know RWX information in order to correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings always come from the host userspace page tables, and private mappings always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <[email protected]> Signed-off-by: Yu Zhang <[email protected]> Signed-off-by: Chao Peng <[email protected]> Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Fuad Tabba <[email protected]> Tested-by: Fuad Tabba <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
a7800aa8 |
| 13-Nov-2023 |
Sean Christopherson <[email protected]> |
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based memory that is tied to a specific KVM virtual mac
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based memory that is tied to a specific KVM virtual machine and whose primary purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements that are kludgy or outright infeasible to implement/support in a generic memory subsystem. With guest_memfd, guest protections and mapping sizes are fully decoupled from host userspace mappings. E.g. KVM currently doesn't support mapping memory as writable in the guest without it also being writable in host userspace, as KVM's ABI uses VMA protections to define the allow guest protection. Userspace can fudge this by establishing two mappings, a writable mapping for the guest and readable one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size, e.g. KVM doesn’t support creating a 1GiB guest mapping unless userspace also has a 1GiB guest mapping. Decoupling the mappings sizes would allow userspace to precisely map only what is needed without impacting guest performance, e.g. to harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner alternative to high-granularity mappings for HugeTLB, which has reached a bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping said memory into the host is critical for Confidential VMs (CoCo VMs), the initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent untrusted software from reading guest private data by encrypting guest memory with a key that isn't usable by the untrusted host, projects such as Protected KVM (pKVM) provide confidentiality and integrity *without* relying on memory encryption. And with SEV-SNP and TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as being mappable only by KVM (or a similarly enlightened kernel subsystem). That approach was abandoned largely due to it needing to play games with PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping guest private memory into userspace, but that approach failed to meet several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel wouldn't easily be able to enforce a 1:1 page:guest association, let alone a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory that isn't backed by 'struct page', e.g. if devices gain support for exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide dedicated file-based guest memory. That approach made it as far as v10 before feedback from Hugh Dickins and Christian Brauner (and others) led to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use case as KVM didn't actually *want* the features provided by shmem. I.e. KVM was using memfd() and shmem to avoid having to manage memory directly, not because memfd() and shmem were the optimal solution, e.g. things like read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping only _some_ of shmem), e.g. poking at inode_operations or super_operations would show shmem stuff, but address_space_operations and file_operations would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey Link: https://lore.kernel.org/linux-mm/[email protected] Cc: Fuad Tabba <[email protected]> Cc: Vishal Annapurve <[email protected]> Cc: Ackerley Tng <[email protected]> Cc: Jarkko Sakkinen <[email protected]> Cc: Maciej Szmigiero <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Quentin Perret <[email protected]> Cc: Michael Roth <[email protected]> Cc: Wang <[email protected]> Cc: Liam Merwick <[email protected]> Cc: Isaku Yamahata <[email protected]> Co-developed-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Co-developed-by: Yu Zhang <[email protected]> Signed-off-by: Yu Zhang <[email protected]> Co-developed-by: Chao Peng <[email protected]> Signed-off-by: Chao Peng <[email protected]> Co-developed-by: Ackerley Tng <[email protected]> Signed-off-by: Ackerley Tng <[email protected]> Co-developed-by: Isaku Yamahata <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Co-developed-by: Paolo Bonzini <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Co-developed-by: Michael Roth <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Reviewed-by: Fuad Tabba <[email protected]> Tested-by: Fuad Tabba <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
5a475554 |
| 27-Oct-2023 |
Chao Peng <[email protected]> |
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is necessary information for KVM to perform operations like page fault handling, page
KVM: Introduce per-page memory attributes
In confidential computing usages, whether a page is private or shared is necessary information for KVM to perform operations like page fault handling, page zapping etc. There are other potential use cases for per-page memory attributes, e.g. to make memory read-only (or no-exec, or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive, not fully optimized implementation, i.e. prioritize correctness over performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX attributes/protections in the future, e.g. to give userspace fine-grained control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common code sets the new attributes, e.g. x86 will use the "pre" hook to zap all relevant mappings, and the "post" hook to track whether or not hugepages can be used to map the range.
To simplify the implementation wrap the entire sequence with kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly guaranteed to be an invalidation. For the initial use case, x86 *will* always invalidate memory, and preventing arch code from creating new mappings while the attributes are in flux makes it much easier to reason about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g. if KVM ends up supporting RWX protections and userspace grants _more_ protections, but again opt for simplicity and punt optimizations to if/when they are needed.
Suggested-by: Sean Christopherson <[email protected]> Link: https://lore.kernel.org/all/[email protected] Cc: Fuad Tabba <[email protected]> Cc: Xu Yilun <[email protected]> Cc: Mickaël Salaün <[email protected]> Signed-off-by: Chao Peng <[email protected]> Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
| #
16f95f3b |
| 27-Oct-2023 |
Chao Peng <[email protected]> |
KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
Add a new KVM exit type to allow userspace to handle memory faults that KVM cannot resolve, but that userspace *may* be able to hand
KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
Add a new KVM exit type to allow userspace to handle memory faults that KVM cannot resolve, but that userspace *may* be able to handle (without terminating the guest).
KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit conversions between private and shared memory. With guest private memory, there will be two kind of memory conversions:
- explicit conversion: happens when the guest explicitly calls into KVM to map a range (as private or shared)
- implicit conversion: happens when the guest attempts to access a gfn that is configured in the "wrong" state (private vs. shared)
On x86 (first architecture to support guest private memory), explicit conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable as there is (obviously) no hypercall, and there is no guarantee that the guest actually intends to convert between private and shared, i.e. what KVM thinks is an implicit conversion "request" could actually be the result of a guest code bug.
KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to be implicit conversions.
Note! To allow for future possibilities where KVM reports KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's perspective), not '0'! Due to historical baggage within KVM, exiting to userspace with '0' from deep callstacks, e.g. in emulation paths, is infeasible as doing so would require a near-complete overhaul of KVM, whereas KVM already propagates -errno return codes to userspace even when the -errno originated in a low level helper.
Report the gpa+size instead of a single gfn even though the initial usage is expected to always report single pages. It's entirely possible, likely even, that KVM will someday support sub-page granularity faults, e.g. Intel's sub-page protection feature allows for additional protections at 128-byte granularity.
Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Cc: Anish Moorthy <[email protected]> Cc: David Matlack <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Co-developed-by: Yu Zhang <[email protected]> Signed-off-by: Yu Zhang <[email protected]> Signed-off-by: Chao Peng <[email protected]> Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Reviewed-by: Fuad Tabba <[email protected]> Tested-by: Fuad Tabba <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|