|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7 |
|
| #
ec2d0c04 |
| 11-Mar-2025 |
Thomas Gleixner <[email protected]> |
posix-timers: Provide a mechanism to allocate a given timer ID
Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers with the same timer ID on restore. It uses sys_timer_create
posix-timers: Provide a mechanism to allocate a given timer ID
Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers with the same timer ID on restore. It uses sys_timer_create() and relies on the monotonic increasing timer ID provided by this syscall. It creates and deletes timers until the desired ID is reached. This is can loop for a long time, when the checkpointed process had a very sparse timer ID range.
It has been debated to implement a new syscall to allow the creation of timers with a given timer ID, but that's tideous due to the 32/64bit compat issues of sigevent_t and of dubious value.
The restore mechanism of CRIU creates the timers in a state where all threads of the restored process are held on a barrier and cannot issue syscalls. That means the restorer task has exclusive control.
This allows to address this issue with a prctl() so that the restorer thread can do:
if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON)) goto linear_mode; create_timers_with_explicit_ids(); prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF); This is backwards compatible because the prctl() fails on older kernels and CRIU can fall back to the linear timer ID mechanism. CRIU versions which do not know about the prctl() just work as before.
Implement the prctl() and modify timer_create() so that it copies the requested timer ID from userspace by utilizing the existing timer_t pointer, which is used to copy out the allocated timer ID on success.
If the prctl() is disabled, which it is by default, timer_create() works as before and does not try to read from the userspace pointer.
There is no problem when a broken or rogue user space application enables the prctl(). If the user space pointer does not contain a valid ID, then timer_create() fails. If the data is not initialized, but constains a random valid ID, timer_create() will create that random timer ID or fail if the ID is already given out. As CRIU must use the raw syscall to avoid manipulating the internal state of the restored process, this has no library dependencies and can be adopted by CRIU right away.
Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with the create/delete method. With the prctl() it takes 3 microseconds.
Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Reviewed-by: Cyrill Gorcunov <[email protected]> Tested-by: Cyrill Gorcunov <[email protected]> Link: https://lore.kernel.org/all/87jz8vz0en.ffs@tglx
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
7903f907 |
| 06-Feb-2025 |
Mateusz Guzik <[email protected]> |
pid: perform free_pid() calls outside of tasklist_lock
As the clone side already executes pid allocation with only pidmap_lock held, issuing free_pid() while still holding tasklist_lock exacerbates
pid: perform free_pid() calls outside of tasklist_lock
As the clone side already executes pid allocation with only pidmap_lock held, issuing free_pid() while still holding tasklist_lock exacerbates total hold time of the latter.
More things may show up later which require initial clean up with the lock held and allow finishing without it. For that reason a struct to collect such work is added instead of merely passing the pid array.
Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: "Liam R. Howlett" <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7 |
|
| #
c38904eb |
| 08-Nov-2024 |
Marco Elver <[email protected]> |
tracing: Add task_prctl_unknown tracepoint
prctl() is a complex syscall which multiplexes its functionality based on a large set of PR_* options. Currently we count 64 such options. The return value
tracing: Add task_prctl_unknown tracepoint
prctl() is a complex syscall which multiplexes its functionality based on a large set of PR_* options. Currently we count 64 such options. The return value of unknown options is -EINVAL, and doesn't distinguish from known options that were passed invalid args that also return -EINVAL.
To understand if programs are attempting to use prctl() options not yet available on the running kernel, provide the task_prctl_unknown tracepoint.
Note, this tracepoint is in an unlikely cold path, and would therefore be suitable for continuous monitoring (e.g. via perf_event_open).
While the above is likely the simplest usecase, additionally this tracepoint can help unlock some testing scenarios (where probing sys_enter or sys_exit causes undesirable performance overheads):
a. unprivileged triggering of a test module: test modules may register a probe to be called back on task_prctl_unknown, and pick a very large unknown prctl() option upon which they perform a test function for an unprivileged user;
b. unprivileged triggering of an eBPF program function: similar as idea (a).
Example trace_pipe output:
test-380 [001] ..... 78.142904: task_prctl_unknown: option=1234 arg2=101 arg3=102 arg4=103 arg5=104
Signed-off-by: Marco Elver <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
6348be02 |
| 20-Jul-2024 |
Al Viro <[email protected]> |
fdget(), trivial conversions
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope.
Reviewed-by: Christian Brauner <[email protected]> Signed
fdget(), trivial conversions
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope.
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
91e102e7 |
| 01-Oct-2024 |
Mark Brown <[email protected]> |
prctl: arch-agnostic prctl for shadow stack
Three architectures (x86, aarch64, riscv) have announced support for shadow stacks with fairly similar functionality. While x86 is using arch_prctl() to
prctl: arch-agnostic prctl for shadow stack
Three architectures (x86, aarch64, riscv) have announced support for shadow stacks with fairly similar functionality. While x86 is using arch_prctl() to control the functionality neither arm64 nor riscv uses that interface so this patch adds arch-agnostic prctl() support to get and set status of shadow stacks and lock the current configuation to prevent further changes, with support for turning on and off individual subfeatures so applications can limit their exposure to features that they do not need. The features are:
- PR_SHADOW_STACK_ENABLE: Tracking and enforcement of shadow stacks, including allocation of a shadow stack if one is not already allocated. - PR_SHADOW_STACK_WRITE: Writes to specific addresses in the shadow stack. - PR_SHADOW_STACK_PUSH: Push additional values onto the shadow stack.
These features are expected to be inherited by new threads and cleared on exec(), unknown features should be rejected for enable but accepted for locking (in order to allow for future proofing).
This is based on a patch originally written by Deepak Gupta but modified fairly heavily, support for indirect landing pads is removed, additional modes added and the locking interface reworked. The set status prctl() is also reworked to just set flags, if setting/reading the shadow stack pointer is required this could be a separate prctl.
Reviewed-by: Thiago Jung Bauermann <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Acked-by: Yury Khrustalev <[email protected]> Signed-off-by: Mark Brown <[email protected]> Reviewed-by: Deepak Gupta <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
show more ...
|
| #
ed4fb6d7 |
| 14-Aug-2024 |
Felix Moessbauer <[email protected]> |
hrtimer: Use and report correct timerslack values for realtime tasks
The timerslack_ns setting is used to specify how much the hardware timers should be delayed, to potentially dispatch multiple tim
hrtimer: Use and report correct timerslack values for realtime tasks
The timerslack_ns setting is used to specify how much the hardware timers should be delayed, to potentially dispatch multiple timers in a single interrupt. This is a performance optimization. Timers of realtime tasks (having a realtime scheduling policy) should not be delayed.
This logic was inconsitently applied to the hrtimers, leading to delays of realtime tasks which used timed waits for events (e.g. condition variables). Due to the downstream override of the slack for rt tasks, the procfs reported incorrect (non-zero) timerslack_ns values.
This is changed by setting the timer_slack_ns task attribute to 0 for all tasks with a rt policy. By that, downstream users do not need to specially handle rt tasks (w.r.t. the slack), and the procfs entry shows the correct value of "0". Setting non-zero slack values (either via procfs or PR_SET_TIMERSLACK) on tasks with a rt policy is ignored, as stated in "man 2 PR_SET_TIMERSLACK":
Timer slack is not applied to threads that are scheduled under a real-time scheduling policy (see sched_setscheduler(2)).
The special handling of timerslack on rt tasks in downstream users is removed as well.
Signed-off-by: Felix Moessbauer <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
1da91ea8 |
| 31-May-2024 |
Al Viro <[email protected]> |
introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are ve
introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict]
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5 |
|
| #
628d701f |
| 17-Apr-2024 |
Benjamin Gray <[email protected]> |
powerpc/dexcr: Add DEXCR prctl interface
Now that we track a DEXCR on a per-task basis, individual tasks are free to configure it as they like.
The interface is a pair of getter/setter prctl's that
powerpc/dexcr: Add DEXCR prctl interface
Now that we track a DEXCR on a per-task basis, individual tasks are free to configure it as they like.
The interface is a pair of getter/setter prctl's that work on a single aspect at a time (multiple aspects at once is more difficult if there are different rules applied for each aspect, now or in future). The getter shows the current state of the process config, and the setter allows setting/clearing the aspect.
Signed-off-by: Benjamin Gray <[email protected]> [mpe: Account for PR_RISCV_SET_ICACHE_FLUSH_CTX, shrink some longs lines] Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
show more ...
|
|
Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
6b9391b5 |
| 12-Mar-2024 |
Charlie Jenkins <[email protected]> |
riscv: Include riscv_set_icache_flush_ctx prctl
Support new prctl with key PR_RISCV_SET_ICACHE_FLUSH_CTX to enable optimization of cross modifying code. This prctl enables userspace code to use icac
riscv: Include riscv_set_icache_flush_ctx prctl
Support new prctl with key PR_RISCV_SET_ICACHE_FLUSH_CTX to enable optimization of cross modifying code. This prctl enables userspace code to use icache flushing instructions such as fence.i with the guarantee that the icache will continue to be clean after thread migration.
Signed-off-by: Charlie Jenkins <[email protected]> Reviewed-by: Atish Patra <[email protected]> Reviewed-by: Alexandre Ghiti <[email protected]> Reviewed-by: Samuel Holland <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7 |
|
| #
d5aad4c2 |
| 27-Feb-2024 |
Zev Weiss <[email protected]> |
prctl: generalize PR_SET_MDWE support check to be per-arch
Patch series "ARM: prctl: Reject PR_SET_MDWE where not supported".
I noticed after a recent kernel update that my ARM926 system started se
prctl: generalize PR_SET_MDWE support check to be per-arch
Patch series "ARM: prctl: Reject PR_SET_MDWE where not supported".
I noticed after a recent kernel update that my ARM926 system started segfaulting on any execve() after calling prctl(PR_SET_MDWE). After some investigation it appears that ARMv5 is incapable of providing the appropriate protections for MDWE, since any readable memory is also implicitly executable.
The prctl_set_mdwe() function already had some special-case logic added disabling it on PARISC (commit 793838138c15, "prctl: Disable prctl(PR_SET_MDWE) on parisc"); this patch series (1) generalizes that check to use an arch_*() function, and (2) adds a corresponding override for ARM to disable MDWE on pre-ARMv6 CPUs.
With the series applied, prctl(PR_SET_MDWE) is rejected on ARMv5 and subsequent execve() calls (as well as mmap(PROT_READ|PROT_WRITE)) can succeed instead of unconditionally failing; on ARMv6 the prctl works as it did previously.
[0] https://lore.kernel.org/all/2023112456-linked-nape-bf19@gregkh/
This patch (of 2):
There exist systems other than PARISC where MDWE may not be feasible to support; rather than cluttering up the generic code with additional arch-specific logic let's add a generic function for checking MDWE support and allow each arch to override it as needed.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zev Weiss <[email protected]> Acked-by: Helge Deller <[email protected]> [parisc] Cc: Borislav Petkov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Florent Revest <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Kees Cook <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Ondrej Mosnacek <[email protected]> Cc: Rick Edgecombe <[email protected]> Cc: Russell King (Oracle) <[email protected]> Cc: Sam James <[email protected]> Cc: Stefan Roesch <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yin Fengwei <[email protected]> Cc: <[email protected]> [6.3+] Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2 |
|
| #
f7ec1cd5 |
| 22-Jan-2024 |
Oleg Nesterov <[email protected]> |
getrusage: use sig->stats_lock rather than lock_task_sighand()
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call getrusage() at the same time and the process has NR_THREADS, spi
getrusage: use sig->stats_lock rather than lock_task_sighand()
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call getrusage() at the same time and the process has NR_THREADS, spin_lock_irq will spin with irqs disabled O(NR_CPUS * NR_THREADS) time.
Change getrusage() to use sig->stats_lock, it was specifically designed for this type of use. This way it runs lockless in the likely case.
TODO: - Change do_task_stat() to use sig->stats_lock too, then we can remove spin_lock_irq(siglock) in wait_task_zombie().
- Turn sig->stats_lock into seqcount_rwlock_t, this way the readers in the slow mode won't exclude each other. See https://lore.kernel.org/all/[email protected]/
- stats_lock has to disable irqs because ->siglock can be taken in irq context, it would be very nice to change __exit_signal() to avoid the siglock->stats_lock dependency.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Reported-by: Dylan Hatch <[email protected]> Tested-by: Dylan Hatch <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
daa694e4 |
| 22-Jan-2024 |
Oleg Nesterov <[email protected]> |
getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()
Patch series "getrusage: use sig->stats_lock", v2.
This patch (of 2):
thread_group_cputime() does its own locking, w
getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()
Patch series "getrusage: use sig->stats_lock", v2.
This patch (of 2):
thread_group_cputime() does its own locking, we can safely shift thread_group_cputime_adjusted() which does another for_each_thread loop outside of ->siglock protected section.
This is also preparation for the next patch which changes getrusage() to use stats_lock instead of siglock, thread_group_cputime() takes the same lock. With the current implementation recursive read_seqbegin_or_lock() is fine, thread_group_cputime() can't enter the slow mode if the caller holds stats_lock, yet this looks more safe and better performance-wise.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Reported-by: Dylan Hatch <[email protected]> Tested-by: Dylan Hatch <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2 |
|
| #
79383813 |
| 18-Nov-2023 |
Helge Deller <[email protected]> |
prctl: Disable prctl(PR_SET_MDWE) on parisc
systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute functionality, but fails on parisc which still needs executable stacks in cert
prctl: Disable prctl(PR_SET_MDWE) on parisc
systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute functionality, but fails on parisc which still needs executable stacks in certain combinations of gcc/glibc/kernel.
Disable prctl(PR_SET_MDWE) by returning -EINVAL for now on parisc, until userspace has catched up.
Signed-off-by: Helge Deller <[email protected]> Co-developed-by: Linus Torvalds <[email protected]> Reported-by: Sam James <[email protected]> Closes: https://github.com/systemd/systemd/issues/29775 Tested-by: Sam James <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Cc: <[email protected]> # v6.3+
show more ...
|
|
Revision tags: v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1 |
|
| #
24e41bf8 |
| 28-Aug-2023 |
Florent Revest <[email protected]> |
mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that the process doesn't want MDWE protection to propagate to children.
To i
mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that the process doesn't want MDWE protection to propagate to children.
To implement this no-inherit mode, the tag in current->mm->flags must be absent from MMF_INIT_MASK. This means that the encoding for "MDWE but without inherit" is different in the prctl than in the mm flags. This leads to a bit of bit-mangling in the prctl implementation.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Florent Revest <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Alexey Izbyshev <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ayush Jain <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Joey Gouly <[email protected]> Cc: KP Singh <[email protected]> Cc: Mark Brown <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Szabolcs Nagy <[email protected]> Cc: Topi Miettinen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
13b7bc60 |
| 09-Sep-2023 |
Oleg Nesterov <[email protected]> |
getrusage: use __for_each_thread()
do/while_each_thread should be avoided when possible.
Plus this change allows to avoid lock_task_sighand(), we can use rcu and/or sig->stats_lock instead.
Link:
getrusage: use __for_each_thread()
do/while_each_thread should be avoided when possible.
Plus this change allows to avoid lock_task_sighand(), we can use rcu and/or sig->stats_lock instead.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
c7ac8231 |
| 09-Sep-2023 |
Oleg Nesterov <[email protected]> |
getrusage: add the "signal_struct *sig" local variable
No functional changes, cleanup/preparation.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <ol
getrusage: add the "signal_struct *sig" local variable
No functional changes, cleanup/preparation.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1 |
|
| #
636e3483 |
| 08-Jul-2023 |
Miguel Ojeda <[email protected]> |
prctl: move PR_GET_AUXV out of PR_MCE_KILL
Somehow PR_GET_AUXV got added into PR_MCE_KILL's switch when the patch was applied [1].
Thus move it out of the switch, to the place the patch added it.
prctl: move PR_GET_AUXV out of PR_MCE_KILL
Somehow PR_GET_AUXV got added into PR_MCE_KILL's switch when the patch was applied [1].
Thus move it out of the switch, to the place the patch added it.
In the recently released v6.4 kernel some user could, in principle, be already using this feature by mapping the right page and passing the PR_GET_AUXV constant as a pointer:
prctl(PR_MCE_KILL, PR_GET_AUXV, ...)
So this does change the behavior for users. We could keep the bug since the other subcases in PR_MCE_KILL (PR_MCE_KILL_CLEAR and PR_MCE_KILL_SET) do not overlap.
However, v6.4 may be recent enough (2 weeks old) that moving the lines (rather than just adding a new case) does not break anybody? Moreover, the documentation in man-pages was just committed today [2].
Link: https://lkml.kernel.org/r/[email protected] Fixes: ddc65971bb67 ("prctl: add PR_GET_AUXV to copy auxv to userspace") Link: https://lore.kernel.org/all/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org/ [1] Link: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=8cf0c06bfd3c2b219b044d4151c96f0da50af9ad [2] Signed-off-by: Miguel Ojeda <[email protected]> Cc: Josh Triplett <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.4, v6.4-rc7, v6.4-rc6 |
|
| #
1fd96a3e |
| 05-Jun-2023 |
Andy Chiu <[email protected]> |
riscv: Add prctl controls for userspace vector management
This patch add two riscv-specific prctls, to allow usespace control the use of vector unit:
* PR_RISCV_V_SET_CONTROL: control the permissi
riscv: Add prctl controls for userspace vector management
This patch add two riscv-specific prctls, to allow usespace control the use of vector unit:
* PR_RISCV_V_SET_CONTROL: control the permission to use Vector at next, or all following execve for a thread. Turning off a thread's Vector live is not possible since libraries may have registered ifunc that may execute Vector instructions. * PR_RISCV_V_GET_CONTROL: get the same permission setting for the current thread, and the setting for following execve(s).
Signed-off-by: Andy Chiu <[email protected]> Reviewed-by: Greentime Hu <[email protected]> Reviewed-by: Vincent Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3 |
|
| #
24139c07 |
| 22-Apr-2023 |
David Hildenbrand <[email protected]> |
mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0
Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup disabling KSM", v2.
(1) Make PR_SET_MEMORY_MERGE=0
mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0
Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup disabling KSM", v2.
(1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE does, (2) add a selftest for it and (3) factor out disabling of KSM from s390/gmap code.
This patch (of 3):
Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear the VM_MERGEABLE flag from all VMAs -- just like KSM would. Of course, only do that if we previously set PR_SET_MEMORY_MERGE=1.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Stefan Roesch <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
d7597f59 |
| 18-Apr-2023 |
Stefan Roesch <[email protected]> |
mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more wo
mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level.
Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available.
In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads.
Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis.
Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls.
Security concerns:
In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
Here are the metrics from an instagram workload (taken from a machine with 64GB main memory):
full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0
After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM.
Detailed changes:
1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be inherited by the new child process.
3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM.
It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes.
This patch (of 3):
So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level.
1. New options for prctl system command
This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be inherited by the new child process.
1) Introduce new MMF_VM_MERGE_ANY flag
This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process.
2) Setting VM_MERGEABLE on VMA creation
When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA.
3) support disabling of ksm for a process
This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl.
4) add new prctl option to get and set ksm for a process
This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process.
3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Bagas Sanjaya <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc7, v6.3-rc6 |
|
| #
ddc65971 |
| 04-Apr-2023 |
Josh Triplett <[email protected]> |
prctl: add PR_GET_AUXV to copy auxv to userspace
If a library wants to get information from auxv (for instance, AT_HWCAP/AT_HWCAP2), it has a few options, none of them perfectly reliable or ideal:
prctl: add PR_GET_AUXV to copy auxv to userspace
If a library wants to get information from auxv (for instance, AT_HWCAP/AT_HWCAP2), it has a few options, none of them perfectly reliable or ideal:
- Be main or the pre-main startup code, and grub through the stack above main. Doesn't work for a library. - Call libc getauxval. Not ideal for libraries that are trying to be libc-independent and/or don't otherwise require anything from other libraries. - Open and read /proc/self/auxv. Doesn't work for libraries that may run in arbitrarily constrained environments that may not have /proc mounted (e.g. libraries that might be used by an init program or a container setup tool). - Assume you're on the main thread and still on the original stack, and try to walk the stack upwards, hoping to find auxv. Extremely bad idea. - Ask the caller to pass auxv in for you. Not ideal for a user-friendly library, and then your caller may have the same problem.
Add a prctl that copies current->mm->saved_auxv to a userspace buffer.
Link: https://lkml.kernel.org/r/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org Signed-off-by: Josh Triplett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2 |
|
| #
659c0ce1 |
| 17-Feb-2023 |
Ondrej Mosnacek <[email protected]> |
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Linux Security Modules (LSMs) that implement the "capable" hook will usually emit an access denial message to the audit log wheneve
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Linux Security Modules (LSMs) that implement the "capable" hook will usually emit an access denial message to the audit log whenever they "block" the current task from using the given capability based on their security policy.
The occurrence of a denial is used as an indication that the given task has attempted an operation that requires the given access permission, so the callers of functions that perform LSM permission checks must take care to avoid calling them too early (before it is decided if the permission is actually needed to perform the requested operation).
The __sys_setres[ug]id() functions violate this convention by first calling ns_capable_setid() and only then checking if the operation requires the capability or not. It means that any caller that has the capability granted by DAC (task's capability set) but not by MAC (LSMs) will generate a "denied" audit record, even if is doing an operation for which the capability is not required.
Fix this by reordering the checks such that ns_capable_setid() is checked last and -EPERM is returned immediately if it returns false.
While there, also do two small optimizations: * move the capability check before prepare_creds() and * bail out early in case of a no-op.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Ondrej Mosnacek <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5 |
|
| #
b507808e |
| 19-Jan-2023 |
Joey Gouly <[email protected]> |
mm: implement memory-deny-write-execute as a prctl
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)", v2.
The background to this is that systemd has a configuration option c
mm: implement memory-deny-write-execute as a prctl
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)", v2.
The background to this is that systemd has a configuration option called MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter. Its aim is to prevent a user task from inadvertently creating an executable mapping that is (or was) writeable. Since such BPF filter is stateless, it cannot detect mappings that were previously writeable but subsequently changed to read-only. Therefore the filter simply rejects any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI support (Branch Target Identification), the dynamic loader cannot change an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). For libraries, it can resort to unmapping and re-mapping but for the main executable it does not have a file descriptor. The original bug report in the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries - [4].
This series adds in-kernel support for this feature as a prctl PR_SET_MDWE, that is inherited on fork(). The prctl denies PROT_WRITE | PROT_EXEC mappings. Like the systemd BPF filter it also denies adding PROT_EXEC to mappings. However unlike the BPF filter it only denies it if the mapping didn't previous have PROT_EXEC. This allows to PROT_EXEC -> PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF filter.
This patch (of 2):
The aim of such policy is to prevent a user task from creating an executable mapping that is also writeable.
An example of mmap() returning -EACCESS if the policy is enabled:
mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
Similarly, mprotect() would return -EACCESS below:
addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
The BPF filter that systemd MDWE uses is stateless, and disallows mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to be enabled if it was already PROT_EXEC, which allows the following case:
addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
where PROT_BTI enables branch tracking identification on arm64.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joey Gouly <[email protected]> Co-developed-by: Catalin Marinas <[email protected]> Signed-off-by: Catalin Marinas <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Jeremy Linton <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lennart Poettering <[email protected]> Cc: Mark Brown <[email protected]> Cc: nd <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Szabolcs Nagy <[email protected]> Cc: Topi Miettinen <[email protected]> Cc: Zbigniew Jędrzejewski-Szmek <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
73979060 |
| 20-Jan-2023 |
Greg Kroah-Hartman <[email protected]> |
prlimit: do_prlimit needs to have a speculation check
do_prlimit() adds the user-controlled resource value to a pointer that will subsequently be dereferenced. In order to help prevent this codepat
prlimit: do_prlimit needs to have a speculation check
do_prlimit() adds the user-controlled resource value to a pointer that will subsequently be dereferenced. In order to help prevent this codepath from being used as a spectre "gadget" a barrier needs to be added after checking the range.
Reported-by: Jordy Zomer <[email protected]> Tested-by: Jordy Zomer <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0 |
|
| #
37608ba3 |
| 27-Sep-2022 |
Jason A. Donenfeld <[email protected]> |
utsname: contribute changes to RNG
On some small machines with little entropy, a quasi-unique hostname is sometimes a relevant factor. I've seen, for example, 8 character alpha-numeric serial number
utsname: contribute changes to RNG
On some small machines with little entropy, a quasi-unique hostname is sometimes a relevant factor. I've seen, for example, 8 character alpha-numeric serial numbers. In addition, the time at which the hostname is set is usually a decent measurement of how long early boot took. So, call add_device_randomness() on new hostnames, which feeds its arguments to the RNG in addition to a fresh cycle counter.
Low cost hooks like this never hurt and can only ever help, and since this costs basically nothing for an operation that is never a fast path, this is an overall easy win.
Cc: Andrew Morton <[email protected]> Cc: Dominik Brodowski <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
show more ...
|