|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7 |
|
| #
ec2d0c04 |
| 11-Mar-2025 |
Thomas Gleixner <[email protected]> |
posix-timers: Provide a mechanism to allocate a given timer ID
Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers with the same timer ID on restore. It uses sys_timer_create
posix-timers: Provide a mechanism to allocate a given timer ID
Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers with the same timer ID on restore. It uses sys_timer_create() and relies on the monotonic increasing timer ID provided by this syscall. It creates and deletes timers until the desired ID is reached. This is can loop for a long time, when the checkpointed process had a very sparse timer ID range.
It has been debated to implement a new syscall to allow the creation of timers with a given timer ID, but that's tideous due to the 32/64bit compat issues of sigevent_t and of dubious value.
The restore mechanism of CRIU creates the timers in a state where all threads of the restored process are held on a barrier and cannot issue syscalls. That means the restorer task has exclusive control.
This allows to address this issue with a prctl() so that the restorer thread can do:
if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON)) goto linear_mode; create_timers_with_explicit_ids(); prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF); This is backwards compatible because the prctl() fails on older kernels and CRIU can fall back to the linear timer ID mechanism. CRIU versions which do not know about the prctl() just work as before.
Implement the prctl() and modify timer_create() so that it copies the requested timer ID from userspace by utilizing the existing timer_t pointer, which is used to copy out the allocated timer ID on success.
If the prctl() is disabled, which it is by default, timer_create() works as before and does not try to read from the userspace pointer.
There is no problem when a broken or rogue user space application enables the prctl(). If the user space pointer does not contain a valid ID, then timer_create() fails. If the data is not initialized, but constains a random valid ID, timer_create() will create that random timer ID or fail if the ID is already given out. As CRIU must use the raw syscall to avoid manipulating the internal state of the restored process, this has no library dependencies and can be adopted by CRIU right away.
Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with the create/delete method. With the prctl() it takes 3 microseconds.
Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Reviewed-by: Cyrill Gorcunov <[email protected]> Tested-by: Cyrill Gorcunov <[email protected]> Link: https://lore.kernel.org/all/87jz8vz0en.ffs@tglx
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4 |
|
| #
09d6775f |
| 16-Oct-2024 |
Samuel Holland <[email protected]> |
riscv: Add support for userspace pointer masking
RISC-V supports pointer masking with a variable number of tag bits (which is called "PMLEN" in the specification) and which is configured at the next
riscv: Add support for userspace pointer masking
RISC-V supports pointer masking with a variable number of tag bits (which is called "PMLEN" in the specification) and which is configured at the next higher privilege level.
Wire up the PR_SET_TAGGED_ADDR_CTRL and PR_GET_TAGGED_ADDR_CTRL prctls so userspace can request a lower bound on the number of tag bits and determine the actual number of tag bits. As with arm64's PR_TAGGED_ADDR_ENABLE, the pointer masking configuration is thread-scoped, inherited on clone() and fork() and cleared on execve().
Reviewed-by: Charlie Jenkins <[email protected]> Tested-by: Charlie Jenkins <[email protected]> Signed-off-by: Samuel Holland <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc3, v6.12-rc2 |
|
| #
91e102e7 |
| 01-Oct-2024 |
Mark Brown <[email protected]> |
prctl: arch-agnostic prctl for shadow stack
Three architectures (x86, aarch64, riscv) have announced support for shadow stacks with fairly similar functionality. While x86 is using arch_prctl() to
prctl: arch-agnostic prctl for shadow stack
Three architectures (x86, aarch64, riscv) have announced support for shadow stacks with fairly similar functionality. While x86 is using arch_prctl() to control the functionality neither arm64 nor riscv uses that interface so this patch adds arch-agnostic prctl() support to get and set status of shadow stacks and lock the current configuation to prevent further changes, with support for turning on and off individual subfeatures so applications can limit their exposure to features that they do not need. The features are:
- PR_SHADOW_STACK_ENABLE: Tracking and enforcement of shadow stacks, including allocation of a shadow stack if one is not already allocated. - PR_SHADOW_STACK_WRITE: Writes to specific addresses in the shadow stack. - PR_SHADOW_STACK_PUSH: Push additional values onto the shadow stack.
These features are expected to be inherited by new threads and cleared on exec(), unknown features should be rejected for enable but accepted for locking (in order to allow for future proofing).
This is based on a patch originally written by Deepak Gupta but modified fairly heavily, support for indirect landing pads is removed, additional modes added and the locking interface reworked. The set status prctl() is also reworked to just set flags, if setting/reading the shadow stack pointer is required this could be a separate prctl.
Reviewed-by: Thiago Jung Bauermann <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Acked-by: Yury Khrustalev <[email protected]> Signed-off-by: Mark Brown <[email protected]> Reviewed-by: Deepak Gupta <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5 |
|
| #
628d701f |
| 17-Apr-2024 |
Benjamin Gray <[email protected]> |
powerpc/dexcr: Add DEXCR prctl interface
Now that we track a DEXCR on a per-task basis, individual tasks are free to configure it as they like.
The interface is a pair of getter/setter prctl's that
powerpc/dexcr: Add DEXCR prctl interface
Now that we track a DEXCR on a per-task basis, individual tasks are free to configure it as they like.
The interface is a pair of getter/setter prctl's that work on a single aspect at a time (multiple aspects at once is more difficult if there are different rules applied for each aspect, now or in future). The getter shows the current state of the process config, and the setter allows setting/clearing the aspect.
Signed-off-by: Benjamin Gray <[email protected]> [mpe: Account for PR_RISCV_SET_ICACHE_FLUSH_CTX, shrink some longs lines] Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
show more ...
|
|
Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
6b9391b5 |
| 12-Mar-2024 |
Charlie Jenkins <[email protected]> |
riscv: Include riscv_set_icache_flush_ctx prctl
Support new prctl with key PR_RISCV_SET_ICACHE_FLUSH_CTX to enable optimization of cross modifying code. This prctl enables userspace code to use icac
riscv: Include riscv_set_icache_flush_ctx prctl
Support new prctl with key PR_RISCV_SET_ICACHE_FLUSH_CTX to enable optimization of cross modifying code. This prctl enables userspace code to use icache flushing instructions such as fence.i with the guarantee that the icache will continue to be clean after thread migration.
Signed-off-by: Charlie Jenkins <[email protected]> Reviewed-by: Atish Patra <[email protected]> Reviewed-by: Alexandre Ghiti <[email protected]> Reviewed-by: Samuel Holland <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1 |
|
| #
24e41bf8 |
| 28-Aug-2023 |
Florent Revest <[email protected]> |
mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that the process doesn't want MDWE protection to propagate to children.
To i
mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that the process doesn't want MDWE protection to propagate to children.
To implement this no-inherit mode, the tag in current->mm->flags must be absent from MMF_INIT_MASK. This means that the encoding for "MDWE but without inherit" is different in the prctl than in the mm flags. This leads to a bit of bit-mangling in the prctl implementation.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Florent Revest <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Alexey Izbyshev <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ayush Jain <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Joey Gouly <[email protected]> Cc: KP Singh <[email protected]> Cc: Mark Brown <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Szabolcs Nagy <[email protected]> Cc: Topi Miettinen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
0da66833 |
| 28-Aug-2023 |
Florent Revest <[email protected]> |
mm: make PR_MDWE_REFUSE_EXEC_GAIN an unsigned long
Defining a prctl flag as an int is a footgun because on a 64 bit machine and with a variadic implementation of prctl (like in musl and glibc), when
mm: make PR_MDWE_REFUSE_EXEC_GAIN an unsigned long
Defining a prctl flag as an int is a footgun because on a 64 bit machine and with a variadic implementation of prctl (like in musl and glibc), when used directly as a prctl argument, it can get casted to long with garbage upper bits which would result in unexpected behaviors.
This patch changes the constant to an unsigned long to eliminate that possibilities. This does not break UAPI.
I think that a stable backport would be "nice to have": to reduce the chances that users build binaries that could end up with garbage bits in their MDWE prctl arguments. We are not aware of anyone having yet encountered this corner case with MDWE prctls but a backport would reduce the likelihood it happens, since this sort of issues has happened with other prctls. But If this is perceived as a backporting burden, I suppose we could also live without a stable backport.
Link: https://lkml.kernel.org/r/[email protected] Fixes: b507808ebce2 ("mm: implement memory-deny-write-execute as a prctl") Signed-off-by: Florent Revest <[email protected]> Suggested-by: Alexey Izbyshev <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ayush Jain <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Joey Gouly <[email protected]> Cc: KP Singh <[email protected]> Cc: Mark Brown <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Szabolcs Nagy <[email protected]> Cc: Topi Miettinen <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6 |
|
| #
1fd96a3e |
| 05-Jun-2023 |
Andy Chiu <[email protected]> |
riscv: Add prctl controls for userspace vector management
This patch add two riscv-specific prctls, to allow usespace control the use of vector unit:
* PR_RISCV_V_SET_CONTROL: control the permissi
riscv: Add prctl controls for userspace vector management
This patch add two riscv-specific prctls, to allow usespace control the use of vector unit:
* PR_RISCV_V_SET_CONTROL: control the permission to use Vector at next, or all following execve for a thread. Turning off a thread's Vector live is not possible since libraries may have registered ifunc that may execute Vector instructions. * PR_RISCV_V_GET_CONTROL: get the same permission setting for the current thread, and the setting for following execve(s).
Signed-off-by: Andy Chiu <[email protected]> Reviewed-by: Greentime Hu <[email protected]> Reviewed-by: Vincent Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3 |
|
| #
d7597f59 |
| 18-Apr-2023 |
Stefan Roesch <[email protected]> |
mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more wo
mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level.
Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available.
In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads.
Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis.
Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls.
Security concerns:
In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
Here are the metrics from an instagram workload (taken from a machine with 64GB main memory):
full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0
After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM.
Detailed changes:
1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be inherited by the new child process.
3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM.
It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes.
This patch (of 3):
So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level.
1. New options for prctl system command
This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be inherited by the new child process.
1) Introduce new MMF_VM_MERGE_ANY flag
This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process.
2) Setting VM_MERGEABLE on VMA creation
When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA.
3) support disabling of ksm for a process
This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl.
4) add new prctl option to get and set ksm for a process
This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process.
3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Bagas Sanjaya <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc7, v6.3-rc6 |
|
| #
ddc65971 |
| 04-Apr-2023 |
Josh Triplett <[email protected]> |
prctl: add PR_GET_AUXV to copy auxv to userspace
If a library wants to get information from auxv (for instance, AT_HWCAP/AT_HWCAP2), it has a few options, none of them perfectly reliable or ideal:
prctl: add PR_GET_AUXV to copy auxv to userspace
If a library wants to get information from auxv (for instance, AT_HWCAP/AT_HWCAP2), it has a few options, none of them perfectly reliable or ideal:
- Be main or the pre-main startup code, and grub through the stack above main. Doesn't work for a library. - Call libc getauxval. Not ideal for libraries that are trying to be libc-independent and/or don't otherwise require anything from other libraries. - Open and read /proc/self/auxv. Doesn't work for libraries that may run in arbitrarily constrained environments that may not have /proc mounted (e.g. libraries that might be used by an init program or a container setup tool). - Assume you're on the main thread and still on the original stack, and try to walk the stack upwards, hoping to find auxv. Extremely bad idea. - Ask the caller to pass auxv in for you. Not ideal for a user-friendly library, and then your caller may have the same problem.
Add a prctl that copies current->mm->saved_auxv to a userspace buffer.
Link: https://lkml.kernel.org/r/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org Signed-off-by: Josh Triplett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5 |
|
| #
b507808e |
| 19-Jan-2023 |
Joey Gouly <[email protected]> |
mm: implement memory-deny-write-execute as a prctl
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)", v2.
The background to this is that systemd has a configuration option c
mm: implement memory-deny-write-execute as a prctl
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)", v2.
The background to this is that systemd has a configuration option called MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter. Its aim is to prevent a user task from inadvertently creating an executable mapping that is (or was) writeable. Since such BPF filter is stateless, it cannot detect mappings that were previously writeable but subsequently changed to read-only. Therefore the filter simply rejects any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI support (Branch Target Identification), the dynamic loader cannot change an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). For libraries, it can resort to unmapping and re-mapping but for the main executable it does not have a file descriptor. The original bug report in the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries - [4].
This series adds in-kernel support for this feature as a prctl PR_SET_MDWE, that is inherited on fork(). The prctl denies PROT_WRITE | PROT_EXEC mappings. Like the systemd BPF filter it also denies adding PROT_EXEC to mappings. However unlike the BPF filter it only denies it if the mapping didn't previous have PROT_EXEC. This allows to PROT_EXEC -> PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF filter.
This patch (of 2):
The aim of such policy is to prevent a user task from creating an executable mapping that is also writeable.
An example of mmap() returning -EACCESS if the policy is enabled:
mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
Similarly, mprotect() would return -EACCESS below:
addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
The BPF filter that systemd MDWE uses is stateless, and disallows mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to be enabled if it was already PROT_EXEC, which allows the following case:
addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
where PROT_BTI enables branch tracking identification on arm64.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joey Gouly <[email protected]> Co-developed-by: Catalin Marinas <[email protected]> Signed-off-by: Catalin Marinas <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Jeremy Linton <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lennart Poettering <[email protected]> Cc: Mark Brown <[email protected]> Cc: nd <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Szabolcs Nagy <[email protected]> Cc: Topi Miettinen <[email protected]> Cc: Zbigniew Jędrzejewski-Szmek <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4 |
|
| #
9e4ab6c8 |
| 19-Apr-2022 |
Mark Brown <[email protected]> |
arm64/sme: Implement vector length configuration prctl()s
As for SVE provide a prctl() interface which allows processes to configure their SME vector length.
Signed-off-by: Mark Brown <broonie@kern
arm64/sme: Implement vector length configuration prctl()s
As for SVE provide a prctl() interface which allows processes to configure their SME vector length.
Signed-off-by: Mark Brown <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8 |
|
| #
cf220ad6 |
| 09-Mar-2022 |
Mark Brown <[email protected]> |
arm64/mte: Remove asymmetric mode from the prctl() interface
As pointed out by Evgenii Stepanov one potential issue with the new ABI for enabling asymmetric is that if there are multiple places wher
arm64/mte: Remove asymmetric mode from the prctl() interface
As pointed out by Evgenii Stepanov one potential issue with the new ABI for enabling asymmetric is that if there are multiple places where MTE is configured in a process, some of which were compiled with the old prctl.h and some of which were compiled with the new prctl.h, there may be problems keeping track of which MTE modes are requested. For example some code may disable only sync and async modes leaving asymmetric mode enabled when it intended to fully disable MTE.
In order to avoid such mishaps remove asymmetric mode from the prctl(), instead implicitly allowing it if both sync and async modes are requested. This should not disrupt userspace since a process requesting both may already see a mix of sync and async modes due to differing defaults between CPUs or changes in default while the process is running but it does mean that userspace is unable to explicitly request asymmetric mode without changing the system default for CPUs.
Reported-by: Evgenii Stepanov <[email protected]> Signed-off-by: Mark Brown <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Reviewed-by: Evgenii Stepanov <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Joey Gouly <[email protected]> Cc: Branislav Rankov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc7, v5.17-rc6, v5.17-rc5 |
|
| #
766121ba |
| 16-Feb-2022 |
Mark Brown <[email protected]> |
arm64/mte: Add userspace interface for enabling asymmetric mode
The architecture provides an asymmetric mode for MTE where tag mismatches are checked asynchronously for stores but synchronously for
arm64/mte: Add userspace interface for enabling asymmetric mode
The architecture provides an asymmetric mode for MTE where tag mismatches are checked asynchronously for stores but synchronously for loads. Allow userspace processes to select this and make it available as a default mode via the existing per-CPU sysfs interface.
Since there PR_MTE_TCF_ values are a bitmask (allowing the kernel to choose between the multiple modes) and there are no free bits adjacent to the existing PR_MTE_TCF_ bits the set of bits used to specify the mode becomes disjoint. Programs using the new interface should be aware of this and programs that do not use it will not see any change in behaviour.
When userspace requests two possible modes but the system default for the CPU is the third mode (eg, default is synchronous but userspace requests either asynchronous or asymmetric) the preference order is:
ASYMM > ASYNC > SYNC
This situation is not currently possible since there are only two modes and it is mandatory to have a system default so there could be no ambiguity and there is no ABI change. The chosen order is basically arbitrary as we do not have a clear metric for what is better here.
If userspace requests specifically asymmetric mode via the prctl() and the system does not support it then we will return an error, this mirrors how we handle the case where userspace enables MTE on a system that does not support MTE at all and the behaviour that will be seen if running on an older kernel that does not support userspace use of asymmetric mode.
Attempts to set asymmetric mode as the default mode will result in an error if the system does not support it.
Signed-off-by: Mark Brown <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Tested-by: Branislav Rankov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1 |
|
| #
9a10064f |
| 14-Jan-2022 |
Colin Cross <[email protected]> |
mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators
mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators in use. At a minimum there is libc malloc and the stack, and in many cases there are libc malloc, the stack, direct syscalls to mmap anonymous memory, and multiple VM heaps (one for small objects, one for big objects, etc.). Each of these layers usually has its own tools to inspect its usage; malloc by compiling a debug version, the VM through heap inspection tools, and for direct syscalls there is usually no way to track them.
On Android we heavily use a set of tools that use an extended version of the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped in userspace and slice their usage by process, shared (COW) vs. unique mappings, backing, etc. This can account for real physical memory usage even in cases like fork without exec (which Android uses heavily to share as many private COW pages as possible between processes), Kernel SamePage Merging, and clean zero pages. It produces a measurement of the pages that only exist in that process (USS, for unique), and a measurement of the physical memory usage of that process with the cost of shared pages being evenly split between processes that share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real physical memory usage (PSS) of each heap requires either a pagemap walking tool that can understand the heap debugging of every layer, or for every layer's heap debugging tools to implement the pagemap walking logic, in which case it is hard to get a consistent view of memory across the whole system.
Tracking the information in userspace leads to all sorts of problems. It either needs to be stored inside the process, which means every process has to have an API to export its current heap information upon request, or it has to be stored externally in a filesystem that somebody needs to clean up on crashes. It needs to be readable while the process is still running, so it has to have some sort of synchronization with every layer of userspace. Efficiently tracking the ranges requires reimplementing something like the kernel vma trees, and linking to it from every layer of userspace. It requires more memory, more syscalls, more runtime cost, and more complexity to separately track regions that the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a userspace-provided name for anonymous vmas. The names of named anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes including NUL-terminator and is checked to contain only printable ascii characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas, which can be understood by the users reading /proc/pid/maps or /proc/pid/smaps. Names can be standardized for a given system and they can include some variable parts such as the name of the allocator or a library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct that points to a null terminated string. Anonymous vmas with the same name (equivalent strings) and are otherwise mergeable will be merged. The name pointers are not shared between vmas even if they contain the same name. The name pointer is stored in a union with fields that are only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this feature. It keeps the feature disabled by default to prevent any additional memory overhead and to avoid confusing procfs parsers on systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more specifically on its latest version [1] posted upstream by Sumit Semwal. It used a userspace pointer to store vma names. In that design, name pointers could be shared between vmas. However during the last upstreaming attempt, Kees Cook raised concerns [2] about this approach and suggested to copy the name into kernel memory space, perform validity checks [3] and store as a string referenced from vm_area_struct.
One big concern is about fork() performance which would need to strdup anonymous vma names. Dave Hansen suggested experimenting with worst-case scenario of forking a process with 64k vmas having longest possible names [4]. I ran this experiment on an ARM64 Android device and recorded a worst-case regression of almost 40% when forking such a process.
This regression is addressed in the followup patch which replaces the pointer to a name with a refcounted structure that allows sharing the name pointer between vmas of the same name. Instead of duplicating the string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/ [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/ [4] https://lore.kernel.org/linux-mm/[email protected]/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA Sets an attribute specified in arg2 for virtual memory areas starting from the address specified in arg3 and spanning the size specified in arg4. arg5 specifies the value of the attribute to be set. Note that assigning an attribute to a virtual memory area might prevent it from being merged with adjacent virtual memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME Set a name for anonymous virtual memory areas. arg5 should be a pointer to a null-terminated string containing the name. The name length including null byte cannot exceed 80 bytes. If arg5 is NULL, the name of the appropriate anonymous virtual memory areas will be reset. The name can contain only printable ascii characters (including space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with the CONFIG_ANON_VMA_NAME option enabled.
[[email protected]: docs: proc.rst: /proc/PID/maps: fix malformed table] Link: https://lkml.kernel.org/r/[email protected] [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy, added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the work here was done by Colin Cross, therefore, with his permission, keeping him as the author]
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Colin Cross <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Al Viro <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jan Glauber <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Stultz <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rob Landley <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Shaohua Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1 |
|
| #
aedad3e1 |
| 05-Nov-2021 |
Peter Collingbourne <[email protected]> |
arm64: mte: change PR_MTE_TCF_NONE back into an unsigned long
This constant was previously an unsigned long, but was changed into an int in commit 433c38f40f6a ("arm64: mte: change ASYNC and SYNC TC
arm64: mte: change PR_MTE_TCF_NONE back into an unsigned long
This constant was previously an unsigned long, but was changed into an int in commit 433c38f40f6a ("arm64: mte: change ASYNC and SYNC TCF settings into bitfields"). This ended up causing spurious unsigned-signed comparison warnings in expressions such as:
(x & PR_MTE_TCF_MASK) != PR_MTE_TCF_NONE
Therefore, change it back into an unsigned long to silence these warnings.
Link: https://linux-review.googlesource.com/id/I07a72310db30227a5b7d789d0b817d78b657c639 Signed-off-by: Peter Collingbourne <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
show more ...
|
|
Revision tags: v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14 |
|
| #
61bc346c |
| 25-Aug-2021 |
Eugene Syromiatnikov <[email protected]> |
uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argument
Commit 7ac592aa35a684ff ("sched: prctl() core-scheduling interface") made use of enum pid_type in prctl's arg4; this t
uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argument
Commit 7ac592aa35a684ff ("sched: prctl() core-scheduling interface") made use of enum pid_type in prctl's arg4; this type and the associated enumeration definitions are not exposed to userspace. Christian has suggested to provide additional macro definitions that convey the meaning of the type argument more in alignment with its actual usage, and this patch does exactly that.
Link: https://lore.kernel.org/r/[email protected] Suggested-by: Christian Brauner <[email protected]> Acked-by: Christian Brauner <[email protected]> Signed-off-by: Eugene Syromiatnikov <[email protected]> Complements: 7ac592aa35a684ff ("sched: prctl() core-scheduling interface") Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4 |
|
| #
433c38f4 |
| 27-Jul-2021 |
Peter Collingbourne <[email protected]> |
arm64: mte: change ASYNC and SYNC TCF settings into bitfields
Allow the user program to specify both ASYNC and SYNC TCF modes by repurposing the existing constants as bitfields. This will allow the
arm64: mte: change ASYNC and SYNC TCF settings into bitfields
Allow the user program to specify both ASYNC and SYNC TCF modes by repurposing the existing constants as bitfields. This will allow the kernel to select one of the modes on behalf of the user program. With this patch the kernel will always select async mode, but a subsequent patch will make this configurable.
Link: https://linux-review.googlesource.com/id/Icc5923c85a8ea284588cc399ae74fd19ec291230 Signed-off-by: Peter Collingbourne <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: Will Deacon <[email protected]> Signed-off-by: Catalin Marinas <[email protected]>
show more ...
|
|
Revision tags: v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3 |
|
| #
e893bb1b |
| 08-Jan-2021 |
Balbir Singh <[email protected]> |
x86, prctl: Hook L1D flushing in via prctl
Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D flush capability. For L1D flushing PR_SPEC_FORCE_DISABLE and PR_SPEC_DISABLE_NOEXEC are
x86, prctl: Hook L1D flushing in via prctl
Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D flush capability. For L1D flushing PR_SPEC_FORCE_DISABLE and PR_SPEC_DISABLE_NOEXEC are not supported.
Enabling L1D flush does not check if the task is running on an SMT enabled core, rather a check is done at runtime (at the time of flush), if the task runs on a SMT sibling then the task is sent a SIGBUS which is executed before the task returns to user space or to a guest.
This is better than the other alternatives of:
a. Ensuring strict affinity of the task (hard to enforce without further changes in the scheduler)
b. Silently skipping flush for tasks that move to SMT enabled cores.
Hook up the core prctl and implement the x86 specific parts which in turn makes it functional.
Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Balbir Singh <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
7ac592aa |
| 24-Mar-2021 |
Chris Hyser <[email protected]> |
sched: prctl() core-scheduling interface
This patch provides support for setting and copying core scheduling 'task cookies' between threads (PID), processes (TGID), and process groups (PGID).
The v
sched: prctl() core-scheduling interface
This patch provides support for setting and copying core scheduling 'task cookies' between threads (PID), processes (TGID), and process groups (PGID).
The value of core scheduling isn't that tasks don't share a core, 'nosmt' can do that. The value lies in exploiting all the sharing opportunities that exist to recover possible lost performance and that requires a degree of flexibility in the API.
From a security perspective (and there are others), the thread, process and process group distinction is an existent hierarchal categorization of tasks that reflects many of the security concerns about 'data sharing'. For example, protecting against cache-snooping by a thread that can just read the memory directly isn't all that useful.
With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a mechanism to create and share cookies. CREATE/SHARE_TO specify a target pid with enum pidtype used to specify the scope of the targeted tasks. For example, PIDTYPE_TGID will share the cookie with the process and all of it's threads as typically desired in a security scenario.
API:
prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie) prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL) prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL) prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)
where 'tgtpid/srcpid == 0' implies the current process and pidtype is kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.
For return values, EINVAL, ENOMEM are what they say. ESRCH means the tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT.
[peterz: complete rewrite] Signed-off-by: Chris Hyser <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
20169862 |
| 19-Mar-2021 |
Peter Collingbourne <[email protected]> |
arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)
This change introduces a prctl that allows the user program to control which PAC keys are enabled in a particular task. The main reason why this
arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)
This change introduces a prctl that allows the user program to control which PAC keys are enabled in a particular task. The main reason why this is useful is to enable a userspace ABI that uses PAC to sign and authenticate function pointers and other pointers exposed outside of the function, while still allowing binaries conforming to the ABI to interoperate with legacy binaries that do not sign or authenticate pointers.
The idea is that a dynamic loader or early startup code would issue this prctl very early after establishing that a process may load legacy binaries, but before executing any PAC instructions.
This change adds a small amount of overhead to kernel entry and exit due to additional required instruction sequences.
On a DragonBoard 845c (Cortex-A75) with the powersave governor, the overhead of similar instruction sequences was measured as 4.9ns when simulating the common case where IA is left enabled, or 43.7ns when simulating the uncommon case where IA is disabled. These numbers can be seen as the worst case scenario, since in more realistic scenarios a better performing governor would be used and a newer chip would be used that would support PAC unlike Cortex-A75 and would be expected to be faster than Cortex-A75.
On an Apple M1 under a hypervisor, the overhead of the entry/exit instruction sequences introduced by this patch was measured as 0.3ns in the case where IA is left enabled, and 33.0ns in the case where IA is disabled.
Signed-off-by: Peter Collingbourne <[email protected]> Reviewed-by: Dave Martin <[email protected]> Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5 Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com Signed-off-by: Catalin Marinas <[email protected]>
show more ...
|
| #
36a6c843 |
| 05-Feb-2021 |
Gabriel Krisman Bertazi <[email protected]> |
entry: Use different define for selector variable in SUD
Michael Kerrisk suggested that, from an API perspective, it is a bad idea to share the PR_SYS_DISPATCH_ defines between the prctl operation a
entry: Use different define for selector variable in SUD
Michael Kerrisk suggested that, from an API perspective, it is a bad idea to share the PR_SYS_DISPATCH_ defines between the prctl operation and the selector variable.
Therefore, define two new constants to be used by SUD's selector variable and update the corresponding documentation and test cases.
While this changes the API syscall user dispatch has never been part of a Linux release, it will show up for the first time in 5.11.
Suggested-by: Michael Kerrisk (man-pages) <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6 |
|
| #
1446e1df |
| 27-Nov-2020 |
Gabriel Krisman Bertazi <[email protected]> |
kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a specific process and redirect to userspace via SIGSYS. This is usefu
kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a specific process and redirect to userspace via SIGSYS. This is useful for processes with parts that require syscall redirection and parts that don't, but who need to perform this boundary crossing really fast, without paying the cost of a system call to reconfigure syscall handling on each boundary transition. This is particularly important for Windows games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+<length>) is a part of the process memory map that is allowed to by-pass the redirection code and dispatch syscalls directly, such that in fast paths a process doesn't need to disable the trap nor the kernel has to check the selector. This is essential to return from SIGSYS to a blocked area without triggering another SIGSYS from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region that has a key switch for the mechanism. This key switch is set to either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it requires very little per-architecture support. I avoided using seccomp, even though it duplicates some functionality, due to previous feedback that maybe it shouldn't mix with seccomp since it is not a security mechanism. And obviously, this should never be considered a security mechanism, since any part of the program can by-pass it by using the syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to executing a native syscall that doesn't require interception, the overhead using only the direct dispatcher region to issue syscalls is pretty much irrelevant. The overhead of using the selector goes around 40ns for a native (unredirected) syscall in my system, and it is (as expected) dominated by the supervisor-mode user-address access. In fact, with SMAP off, the overhead is consistently less than 5ns on my test box.
Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Andy Lutomirski <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2, v5.7-rc1, v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3, v5.5-rc2 |
|
| #
af5ce952 |
| 10-Dec-2019 |
Catalin Marinas <[email protected]> |
arm64: mte: Allow user control of the generated random tags via prctl()
The IRG, ADDG and SUBG instructions insert a random tag in the resulting address. Certain tags can be excluded via the GCR_EL1
arm64: mte: Allow user control of the generated random tags via prctl()
The IRG, ADDG and SUBG instructions insert a random tag in the resulting address. Certain tags can be excluded via the GCR_EL1.Exclude bitmap when, for example, the user wants a certain colour for freed buffers. Since the GCR_EL1 register is not accessible at EL0, extend the prctl(PR_SET_TAGGED_ADDR_CTRL) interface to include a 16-bit field in the first argument for controlling which tags can be generated by the above instruction (an include rather than exclude mask). Note that by default all non-zero tags are excluded. This setting is per-thread.
Signed-off-by: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]>
show more ...
|
|
Revision tags: v5.5-rc1 |
|
| #
1c101da8 |
| 27-Nov-2019 |
Catalin Marinas <[email protected]> |
arm64: mte: Allow user control of the tag check mode via prctl()
By default, even if PROT_MTE is set on a memory range, there is no tag check fault reporting (SIGSEGV). Introduce a set of option to
arm64: mte: Allow user control of the tag check mode via prctl()
By default, even if PROT_MTE is set on a memory range, there is no tag check fault reporting (SIGSEGV). Introduce a set of option to the exiting prctl(PR_SET_TAGGED_ADDR_CTRL) to allow user control of the tag check fault mode:
PR_MTE_TCF_NONE - no reporting (default) PR_MTE_TCF_SYNC - synchronous tag check fault reporting PR_MTE_TCF_ASYNC - asynchronous tag check fault reporting
These options translate into the corresponding SCTLR_EL1.TCF0 bitfield, context-switched by the kernel. Note that the kernel accesses to the user address space (e.g. read() system call) are not checked if the user thread tag checking mode is PR_MTE_TCF_NONE or PR_MTE_TCF_ASYNC. If the tag checking mode is PR_MTE_TCF_SYNC, the kernel makes a best effort to check its user address accesses, however it cannot always guarantee it.
Signed-off-by: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]>
show more ...
|