|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
169eae77 |
| 27-Mar-2025 |
Mathieu Desnoyers <[email protected]> |
rseq: Eliminate useless task_work on execve
Eliminate a useless task_work on execve by moving the call to rseq_set_notify_resume() from sched_mm_cid_after_execve() to the error path of bprm_execve()
rseq: Eliminate useless task_work on execve
Eliminate a useless task_work on execve by moving the call to rseq_set_notify_resume() from sched_mm_cid_after_execve() to the error path of bprm_execve().
The call to rseq_set_notify_resume() from sched_mm_cid_after_execve() is pointless in the success case, because rseq_execve() will clear the rseq pointer before returning to userspace.
sched_mm_cid_after_execve() is called from both the success and error paths of bprm_execve(). The call to rseq_set_notify_resume() is needed on error because the mm_cid may have changed.
Also move the rseq_execve() to right after sched_mm_cid_after_execve() in bprm_execve().
[ mingo: Merged to a recent upstream kernel, extended the changelog. ]
Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
af7bb0d2 |
| 24-Mar-2025 |
Oleg Nesterov <[email protected]> |
exec: fix the racy usage of fs_struct->in_exec
check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve() paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it f
exec: fix the racy usage of fs_struct->in_exec
check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve() paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it fails we have the following race:
T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex
T2 sets fs->in_exec = 1
T1 clears fs->in_exec
T2 continues with fs->in_exec == 0
Change fs/exec.c to clear fs->in_exec with cred_guard_mutex held.
Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected]/ Cc: [email protected] Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
cc9554e6 |
| 23-Feb-2025 |
Yonatan Goldschmidt <[email protected]> |
binfmt: Remove loader from linux_binprm struct
Commit 987f20a9dcce ("a.out: Remove the a.out implementation") removed the last in-tree user of the loader field, and as far as I can tell, it was the
binfmt: Remove loader from linux_binprm struct
Commit 987f20a9dcce ("a.out: Remove the a.out implementation") removed the last in-tree user of the loader field, and as far as I can tell, it was the only one historically.
Signed-off-by: Yonatan Goldschmidt <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
1751f872 |
| 28-Jan-2025 |
Joel Granados <[email protected]> |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysc
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers.
Created this by running an spatch followed by a sed command: Spatch: virtual patch
@ depends on !(file in "net") disable optional_qualifier @
identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@
+ const struct ctl_table table_name [] = { ... };
sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c
Reviewed-by: Song Liu <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> # for kernel/trace/ Reviewed-by: Martin K. Petersen <[email protected]> # SCSI Reviewed-by: Darrick J. Wong <[email protected]> # xfs Acked-by: Jani Nikula <[email protected]> Acked-by: Corey Minyard <[email protected]> Acked-by: Wei Liu <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Reviewed-by: Bill O'Donnell <[email protected]> Acked-by: Baoquan He <[email protected]> Acked-by: Ashutosh Dixit <[email protected]> Acked-by: Anna Schumaker <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2 |
|
| #
7a571499 |
| 03-Dec-2024 |
Lorenzo Stoakes <[email protected]> |
mm: abstract get_arg_page() stack expansion and mmap read lock
Right now fs/exec.c invokes expand_downwards(), an otherwise internal implementation detail of the VMA logic in order to ensure that an
mm: abstract get_arg_page() stack expansion and mmap read lock
Right now fs/exec.c invokes expand_downwards(), an otherwise internal implementation detail of the VMA logic in order to ensure that an arg page can be obtained by get_user_pages_remote().
In order to be able to move the stack expansion logic into mm/vma.c to make it available to userland testing we need to find an alternative approach here.
We do so by providing the mmap_read_lock_maybe_expand() function which also helpfully documents what get_arg_page() is doing here and adds an additional check against VM_GROWSDOWN to make explicit that the stack expansion logic is only invoked when the VMA is indeed a downward-growing stack.
This allows expand_downwards() to become a static function.
Importantly, the VMA referenced by mmap_read_maybe_expand() must NOT be currently user-visible in any way, that is place within an rmap or VMA tree. It must be a newly allocated VMA.
This is the case when exec invokes this function.
Link: https://lkml.kernel.org/r/5295d1c70c58e6aa63d14be68d4e1de9fa1c8e6d.1733248985.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kees Cook <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
a5874fde |
| 12-Dec-2024 |
Mickaël Salaün <[email protected]> |
exec: Add a new AT_EXECVE_CHECK flag to execveat(2)
Add a new AT_EXECVE_CHECK flag to execveat(2) to check if a file would be allowed for execution. The main use case is for script interpreters and
exec: Add a new AT_EXECVE_CHECK flag to execveat(2)
Add a new AT_EXECVE_CHECK flag to execveat(2) to check if a file would be allowed for execution. The main use case is for script interpreters and dynamic linkers to check execution permission according to the kernel's security policy. Another use case is to add context to access logs e.g., which script (instead of interpreter) accessed a file. As any executable code, scripts could also use this check [1].
This is different from faccessat(2) + X_OK which only checks a subset of access rights (i.e. inode permission and mount options for regular files), but not the full context (e.g. all LSM access checks). The main use case for access(2) is for SUID processes to (partially) check access on behalf of their caller. The main use case for execveat(2) + AT_EXECVE_CHECK is to check if a script execution would be allowed, according to all the different restrictions in place. Because the use of AT_EXECVE_CHECK follows the exact kernel semantic as for a real execution, user space gets the same error codes.
An interesting point of using execveat(2) instead of openat2(2) is that it decouples the check from the enforcement. Indeed, the security check can be logged (e.g. with audit) without blocking an execution environment not yet ready to enforce a strict security policy.
LSMs can control or log execution requests with security_bprm_creds_for_exec(). However, to enforce a consistent and complete access control (e.g. on binary's dependencies) LSMs should restrict file executability, or measure executed files, with security_file_open() by checking file->f_flags & __FMODE_EXEC.
Because AT_EXECVE_CHECK is dedicated to user space interpreters, it doesn't make sense for the kernel to parse the checked files, look for interpreters known to the kernel (e.g. ELF, shebang), and return ENOEXEC if the format is unknown. Because of that, security_bprm_check() is never called when AT_EXECVE_CHECK is used.
It should be noted that script interpreters cannot directly use execveat(2) (without this new AT_EXECVE_CHECK flag) because this could lead to unexpected behaviors e.g., `python script.sh` could lead to Bash being executed to interpret the script. Unlike the kernel, script interpreters may just interpret the shebang as a simple comment, which should not change for backward compatibility reasons.
Because scripts or libraries files might not currently have the executable permission set, or because we might want specific users to be allowed to run arbitrary scripts, the following patch provides a dynamic configuration mechanism with the SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE securebits.
This is a redesign of the CLIP OS 4's O_MAYEXEC: https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch This patch has been used for more than a decade with customized script interpreters. Some examples can be found here: https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC
Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Kees Cook <[email protected]> Acked-by: Paul Moore <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Reviewed-by: Jeff Xu <[email protected]> Tested-by: Jeff Xu <[email protected]> Link: https://docs.python.org/3/library/io.html#io.open_code [1] Signed-off-by: Mickaël Salaün <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1 |
|
| #
543841d1 |
| 21-Nov-2024 |
Kees Cook <[email protected]> |
exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case
Zbigniew mentioned at Linux Plumber's that systemd is interested in switching to execveat() for service execution, but can't, because
exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case
Zbigniew mentioned at Linux Plumber's that systemd is interested in switching to execveat() for service execution, but can't, because the contents of /proc/pid/comm are the file descriptor which was used, instead of the path to the binary[1]. This makes the output of tools like top and ps useless, especially in a world where most fds are opened CLOEXEC so the number is truly meaningless.
When the filename passed in is empty (e.g. with AT_EMPTY_PATH), use the dentry's filename for "comm" instead of using the useless numeral from the synthetic fdpath construction. This way the actual exec machinery is unchanged, but cosmetically the comm looks reasonable to admins investigating things.
Instead of adding TASK_COMM_LEN more bytes to bprm, use one of the unused flag bits to indicate that we need to set "comm" from the dentry.
Suggested-by: Zbigniew Jędrzejewski-Szmek <[email protected]> Suggested-by: Tycho Andersen <[email protected]> Suggested-by: Al Viro <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Link: https://github.com/uapi-group/kernel-features#set-comm-field-before-exec [1] Reviewed-by: Aleksa Sarai <[email protected]> Tested-by: Zbigniew Jędrzejewski-Szmek <[email protected]> Signed-off-by: Kees Cook <[email protected]>
show more ...
|
| #
3a3f61ce |
| 30-Nov-2024 |
Kees Cook <[email protected]> |
exec: Make sure task->comm is always NUL-terminated
Using strscpy() meant that the final character in task->comm may be non-NUL for a moment before the "string too long" truncation happens.
Instead
exec: Make sure task->comm is always NUL-terminated
Using strscpy() meant that the final character in task->comm may be non-NUL for a moment before the "string too long" truncation happens.
Instead of adding a new use of the ambiguous strncpy(), we'd want to use memtostr_pad() which enforces being able to check at compile time that sizes are sensible, but this requires being able to see string buffer lengths. Instead of trying to inline __set_task_comm() (which needs to call trace and perf functions), just open-code it. But to make sure we're always safe, add compile-time checking like we already do for get_task_comm().
Suggested-by: Linus Torvalds <[email protected]> Suggested-by: "Eric W. Biederman" <[email protected]> Signed-off-by: Kees Cook <[email protected]>
show more ...
|
| #
0357ef03 |
| 28-Nov-2024 |
Amir Goldstein <[email protected]> |
fs: don't block write during exec on pre-content watched files
Commit 2a010c412853 ("fs: don't block i_writecount during exec") removed the legacy behavior of getting ETXTBSY on attempt to open and
fs: don't block write during exec on pre-content watched files
Commit 2a010c412853 ("fs: don't block i_writecount during exec") removed the legacy behavior of getting ETXTBSY on attempt to open and executable file for write while it is being executed.
This commit was reverted because an application that depends on this legacy behavior was broken by the change.
We need to allow HSM writing into executable files while executed to fill their content on-the-fly.
To that end, disable the ETXTBSY legacy behavior for files that are watched by pre-content events.
This change is not expected to cause regressions with existing systems which do not have any pre-content event listeners.
Signed-off-by: Amir Goldstein <[email protected]> Acked-by: Christian Brauner <[email protected]> Signed-off-by: Jan Kara <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
|
Revision tags: v6.12 |
|
| #
fa1bdca9 |
| 16-Nov-2024 |
Nir Lichtman <[email protected]> |
exec: remove legacy custom binfmt modules autoloading
Problem: The search binary handler logic contains legacy code to handle automatically loading kernel modules of unsupported binary formats. This
exec: remove legacy custom binfmt modules autoloading
Problem: The search binary handler logic contains legacy code to handle automatically loading kernel modules of unsupported binary formats. This logic is a leftover from a.out-to-ELF transition. After removal of a.out support, this code has no use anymore.
Solution: Clean up this code from the search binary handler, also remove the line initialising retval to -ENOENT and instead just return -ENOEXEC if the flow has reached the end of the func.
Note: Anyone who might find future uses for this legacy code would be better off using binfmt_misc to trigger whatever module loading they might need - would be more flexible that way.
Suggested-by: Alexander Viro <[email protected]> Signed-off-by: Nir Lichtman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc7, v6.12-rc6 |
|
| #
4188fc31 |
| 02-Nov-2024 |
[email protected] <[email protected]> |
exec: move warning of null argv to be next to the relevant code
Problem: The warning is currently printed where it is detected that the arg count is zero but the action is only taken place later in
exec: move warning of null argv to be next to the relevant code
Problem: The warning is currently printed where it is detected that the arg count is zero but the action is only taken place later in the flow even though the warning is written as if the action is taken place in the time of print
This could be problematic since there could be a failure between the print and the code that takes action which would deem this warning misleading
Solution: Move the warning print after the action of adding an empty string as the first argument is successful
Signed-off-by: Nir Lichtman <[email protected]> Link: https://lore.kernel.org/r/ZyYUgiPc8A8i_3FH@nirs-laptop. Signed-off-by: Kees Cook <[email protected]>
show more ...
|
| #
3b832035 |
| 27-Nov-2024 |
Christian Brauner <[email protected]> |
Revert "fs: don't block i_writecount during exec"
This reverts commit 2a010c41285345da60cece35575b4e0af7e7bf44.
Rui Ueyama <[email protected]> writes:
> I'm the creator and the maintainer of the mo
Revert "fs: don't block i_writecount during exec"
This reverts commit 2a010c41285345da60cece35575b4e0af7e7bf44.
Rui Ueyama <[email protected]> writes:
> I'm the creator and the maintainer of the mold linker > (https://github.com/rui314/mold). Recently, we discovered that mold > started causing process crashes in certain situations due to a change > in the Linux kernel. Here are the details: > > - In general, overwriting an existing file is much faster than > creating an empty file and writing to it on Linux, so mold attempts to > reuse an existing executable file if it exists. > > - If a program is running, opening the executable file for writing > previously failed with ETXTBSY. If that happens, mold falls back to > creating a new file. > > - However, the Linux kernel recently changed the behavior so that > writing to an executable file is now always permitted > (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a010c412853). > > That caused mold to write to an executable file even if there's a > process running that file. Since changes to mmap'ed files are > immediately visible to other processes, any processes running that > file would almost certainly crash in a very mysterious way. > Identifying the cause of these random crashes took us a few days. > > Rejecting writes to an executable file that is currently running is a > well-known behavior, and Linux had operated that way for a very long > time. So, I don’t believe relying on this behavior was our mistake; > rather, I see this as a regression in the Linux kernel.
Quoting myself from commit 2a010c412853 ("fs: don't block i_writecount during exec")
> Yes, someone in userspace could potentially be relying on this. It's not > completely out of the realm of possibility but let's find out if that's > actually the case and not guess.
It seems we found out that someone is relying on this obscure behavior. So revert the change.
Link: https://github.com/rui314/mold/issues/1361 Link: https://lore.kernel.org/r/[email protected] Cc: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc5, v6.12-rc4, v6.12-rc3 |
|
| #
4cc0473d |
| 07-Oct-2024 |
Yafang Shao <[email protected]> |
get rid of __get_task_comm()
Patch series "Improve the copy of task comm", v8.
Using {memcpy,strncpy,strcpy,kstrdup} to copy the task comm relies on the length of task comm. Changes in the task co
get rid of __get_task_comm()
Patch series "Improve the copy of task comm", v8.
Using {memcpy,strncpy,strcpy,kstrdup} to copy the task comm relies on the length of task comm. Changes in the task comm could result in a destination string that is overflow. Therefore, we should explicitly ensure the destination string is always NUL-terminated, regardless of the task comm. This approach will facilitate future extensions to the task comm.
As suggested by Linus [0], we can identify all relevant code with the following git grep command:
git grep 'memcpy.*->comm\>' git grep 'kstrdup.*->comm\>' git grep 'strncpy.*->comm\>' git grep 'strcpy.*->comm\>'
PATCH #2~#4: memcpy PATCH #5~#6: kstrdup PATCH #7: strcpy
Please note that strncpy() is not included in this series as it is being tracked by another effort. [1]
This patch (of 7):
We want to eliminate the use of __get_task_comm() for the following reasons:
- The task_lock() is unnecessary Quoted from Linus [0]: : Since user space can randomly change their names anyway, using locking : was always wrong for readers (for writers it probably does make sense : to have some lock - although practically speaking nobody cares there : either, but at least for a writer some kind of race could have : long-term mixed results
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/CAHk-=wivfrF0_zvf+oj6==Sh=-npJooP8chLPEfaFV0oNYTTBA@mail.gmail.com [0] Link: https://lore.kernel.org/all/CAHk-=whWtUC-AjmGJveAETKOMeMFSTwKwu99v7+b6AyHMmaDFA@mail.gmail.com/ Link: https://lore.kernel.org/all/CAHk-=wjAmmHUg6vho1KjzQi2=psR30+CogFd4aXrThr2gsiS4g@mail.gmail.com/ [0] Link: https://github.com/KSPP/linux/issues/90 [1] Signed-off-by: Yafang Shao <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Kees Cook <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Matus Jokay <[email protected]> Cc: Alejandro Colomar <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Justin Stitt <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Eric Paris <[email protected]> Cc: James Morris <[email protected]> Cc: Maarten Lankhorst <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Maxime Ripard <[email protected]> Cc: Ondrej Mosnacek <[email protected]> Cc: Paul Moore <[email protected]> Cc: Quentin Monnet <[email protected]> Cc: Simon Horman <[email protected]> Cc: Stephen Smalley <[email protected]> Cc: Thomas Zimmermann <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
7e019dcc |
| 09-Oct-2024 |
Mathieu Desnoyers <[email protected]> |
sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid") introduced a per-mm/cpu current concu
sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid") introduced a per-mm/cpu current concurrency id (mm_cid), which keeps a reference to the concurrency id allocated for each CPU. This reference expires shortly after a 100ms delay.
These per-CPU references keep the per-mm-cid data cache-local in situations where threads are running at least once on each CPU within each 100ms window, thus keeping the per-cpu reference alive.
However, intermittent workloads behaving in bursts spaced by more than 100ms on each CPU exhibit bad cache locality and degraded performance compared to purely per-cpu data indexing, because concurrency IDs are allocated over various CPUs and cores, therefore losing cache locality of the associated data.
Introduce the following changes to improve per-mm-cid cache locality:
- Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep track of which mm_cid value was last used, and use it as a hint to attempt re-allocating the same concurrency ID the next time this mm/cpu needs to allocate a concurrency ID,
- Add a per-mm CPUs allowed mask, which keeps track of the union of CPUs allowed for all threads belonging to this mm. This cpumask is only set during the lifetime of the mm, never cleared, so it represents the union of all the CPUs allowed since the beginning of the mm lifetime (note that the mm_cpumask() is really arch-specific and tailored to the TLB flush needs, and is thus _not_ a viable approach for this),
- Add a per-mm nr_cpus_allowed to keep track of the weight of the per-mm CPUs allowed mask (for fast access),
- Add a per-mm max_nr_cid to keep track of the highest number of concurrency IDs allocated for the mm. This is used for expanding the concurrency ID allocation within the upper bound defined by:
min(mm->nr_cpus_allowed, mm->mm_users)
When the next unused CID value reaches this threshold, stop trying to expand the cid allocation and use the first available cid value instead.
Spreading allocation to use all the cid values within the range
[ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
improves cache locality while preserving mm_cid compactness within the expected user limits,
- In __mm_cid_try_get, only return cid values within the range [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This prevents allocating cids above the number of allowed cpus in rare scenarios where cid allocation races with a concurrent remote-clear of the per-mm/cpu cid. This improvement is made possible by the addition of the per-mm CPUs allowed mask,
- In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than t->nr_cpus_allowed. This criterion was really meant to compare the number of mm->mm_users to the number of CPUs allowed for the entire mm. Therefore, the prior comparison worked fine when all threads shared the same CPUs allowed mask, but not so much in scenarios where those threads have different masks (e.g. each thread pinned to a single CPU). This improvement is made possible by the addition of the per-mm CPUs allowed mask.
* Benchmarks
Each thread increments 16kB worth of 8-bit integers in bursts, with a configurable delay between each thread's execution. Each thread run one after the other (no threads run concurrently). The order of thread execution in the sequence is random. The thread execution sequence begins again after all threads have executed. The 16kB areas are allocated with rseq_mempool and indexed by either cpu_id, mm_cid (not cache-local), or cache-local mm_cid. Each thread is pinned to its own core.
Testing configurations:
8-core/1-L3: Use 8 cores within a single L3 24-core/24-L3: Use 24 cores, 1 core per L3 192-core/24-L3: Use 192 cores (all cores in the system) 384-thread/24-L3: Use 384 HW threads (all HW threads in the system)
Intermittent workload delays between threads: 200ms, 10ms.
Hardware:
CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Caches (sum of all): L1d: 6 MiB (192 instances) L1i: 6 MiB (192 instances) L2: 192 MiB (192 instances) L3: 768 MiB (24 instances)
Each result is an average of 5 test runs. The cache-local speedup is calculated as: (cache-local mm_cid) / (mm_cid).
Intermittent workload delay: 200ms
per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 1374 19289 1336 14.4x 24-core/24-L3 2423 26721 1594 16.7x 192-core/24-L3 2291 15826 2153 7.3x 384-thread/24-L3 1874 13234 1907 6.9x
Intermittent workload delay: 10ms
per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 662 756 686 1.1x 24-core/24-L3 1378 3648 1035 3.5x 192-core/24-L3 1439 10833 1482 7.3x 384-thread/24-L3 1503 10570 1556 6.8x
[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs" patch series with a simpler and more general approach. ]
[ This patch applies on top of v6.12-rc1. ]
Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Marco Elver <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/
show more ...
|
|
Revision tags: v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7 |
|
| #
f31b2569 |
| 07-Sep-2024 |
Helge Deller <[email protected]> |
parisc: Fix stack start for ADDR_NO_RANDOMIZE personality
Fix the stack start address calculation for the parisc architecture in setup_arg_pages() when address randomization is disabled. When the AD
parisc: Fix stack start for ADDR_NO_RANDOMIZE personality
Fix the stack start address calculation for the parisc architecture in setup_arg_pages() when address randomization is disabled. When the ADDR_NO_RANDOMIZE process personality is disabled there is no need to add additional space for the stack. Note that this patch touches code inside an #ifdef CONFIG_STACK_GROWSUP hunk, which is why only the parisc architecture is affected since it's the only Linux architecture where the stack grows upwards.
Without this patch you will find the stack in the middle of some mapped libaries and suddenly limited to 6MB instead of 8MB:
root@parisc:~# setarch -R /bin/bash -c "cat /proc/self/maps" 00010000-00019000 r-xp 00000000 08:05 1182034 /usr/bin/cat 00019000-0001a000 rwxp 00009000 08:05 1182034 /usr/bin/cat 0001a000-0003b000 rwxp 00000000 00:00 0 [heap] f90c4000-f9283000 r-xp 00000000 08:05 1573004 /usr/lib/hppa-linux-gnu/libc.so.6 f9283000-f9285000 r--p 001bf000 08:05 1573004 /usr/lib/hppa-linux-gnu/libc.so.6 f9285000-f928a000 rwxp 001c1000 08:05 1573004 /usr/lib/hppa-linux-gnu/libc.so.6 f928a000-f9294000 rwxp 00000000 00:00 0 f9301000-f9323000 rwxp 00000000 00:00 0 [stack] f98b4000-f98e4000 r-xp 00000000 08:05 1572869 /usr/lib/hppa-linux-gnu/ld.so.1 f98e4000-f98e5000 r--p 00030000 08:05 1572869 /usr/lib/hppa-linux-gnu/ld.so.1 f98e5000-f98e9000 rwxp 00031000 08:05 1572869 /usr/lib/hppa-linux-gnu/ld.so.1 f9ad8000-f9b00000 rw-p 00000000 00:00 0 f9b00000-f9b01000 r-xp 00000000 00:00 0 [vdso]
With the patch the stack gets correctly mapped at the end of the process memory map:
root@panama:~# setarch -R /bin/bash -c "cat /proc/self/maps" 00010000-00019000 r-xp 00000000 08:13 16385582 /usr/bin/cat 00019000-0001a000 rwxp 00009000 08:13 16385582 /usr/bin/cat 0001a000-0003b000 rwxp 00000000 00:00 0 [heap] fef29000-ff0eb000 r-xp 00000000 08:13 16122400 /usr/lib/hppa-linux-gnu/libc.so.6 ff0eb000-ff0ed000 r--p 001c2000 08:13 16122400 /usr/lib/hppa-linux-gnu/libc.so.6 ff0ed000-ff0f2000 rwxp 001c4000 08:13 16122400 /usr/lib/hppa-linux-gnu/libc.so.6 ff0f2000-ff0fc000 rwxp 00000000 00:00 0 ff4b4000-ff4e4000 r-xp 00000000 08:13 16121913 /usr/lib/hppa-linux-gnu/ld.so.1 ff4e4000-ff4e6000 r--p 00030000 08:13 16121913 /usr/lib/hppa-linux-gnu/ld.so.1 ff4e6000-ff4ea000 rwxp 00032000 08:13 16121913 /usr/lib/hppa-linux-gnu/ld.so.1 ff6d7000-ff6ff000 rw-p 00000000 00:00 0 ff6ff000-ff700000 r-xp 00000000 00:00 0 [vdso] ff700000-ff722000 rwxp 00000000 00:00 0 [stack]
Reported-by: Camm Maguire <[email protected]> Signed-off-by: Helge Deller <[email protected]> Fixes: d045c77c1a69 ("parisc,metag: Fix crashes due to stack randomization on stack-grows-upwards architectures") Fixes: 17d9822d4b4c ("parisc: Consider stack randomization for mmap base only when necessary") Cc: [email protected] # v5.2+
show more ...
|
|
Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
d61f0d59 |
| 29-Jul-2024 |
Lorenzo Stoakes <[email protected]> |
mm: move vma_shrink(), vma_expand() to internal header
The vma_shrink() and vma_expand() functions are internal VMA manipulation functions which we ought to abstract for use outside of memory manage
mm: move vma_shrink(), vma_expand() to internal header
The vma_shrink() and vma_expand() functions are internal VMA manipulation functions which we ought to abstract for use outside of memory management code.
To achieve this, we replace shift_arg_pages() in fs/exec.c with an invocation of a new relocate_vma_down() function implemented in mm/mmap.c, which enables us to also move move_page_tables() and vma_iter_prev_range() to internal.h.
The purpose of doing this is to isolate key VMA manipulation functions in order that we can both abstract them and later render them easily testable.
Link: https://lkml.kernel.org/r/3cfcd9ec433e032a85f636fdc0d7d98fafbd19c5.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
0d196e75 |
| 05-Aug-2024 |
Mateusz Guzik <[email protected]> |
exec: don't WARN for racy path_noexec check
Both i_mode and noexec checks wrapped in WARN_ON stem from an artifact of the previous implementation. They used to legitimately check for the condition,
exec: don't WARN for racy path_noexec check
Both i_mode and noexec checks wrapped in WARN_ON stem from an artifact of the previous implementation. They used to legitimately check for the condition, but that got moved up in two commits: 633fb6ac3980 ("exec: move S_ISREG() check earlier") 0fd338b2d2cd ("exec: move path_noexec() check earlier")
Instead of being removed said checks are WARN_ON'ed instead, which has some debug value.
However, the spurious path_noexec check is racy, resulting in unwarranted warnings should someone race with setting the noexec flag.
One can note there is more to perm-checking whether execve is allowed and none of the conditions are guaranteed to still hold after they were tested for.
Additionally this does not validate whether the code path did any perm checking to begin with -- it will pass if the inode happens to be regular.
Keep the redundant path_noexec() check even though it's mindless nonsense checking for guarantee that isn't given so drop the WARN.
Reword the commentary and do small tidy ups while here.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] [brauner: keep redundant path_noexec() check] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
f50733b4 |
| 08-Aug-2024 |
Kees Cook <[email protected]> |
exec: Fix ToCToU between perm check and set-uid/gid usage
When opening a file for exec via do_filp_open(), permission checking is done against the file's metadata at that moment, and on success, a f
exec: Fix ToCToU between perm check and set-uid/gid usage
When opening a file for exec via do_filp_open(), permission checking is done against the file's metadata at that moment, and on success, a file pointer is passed back. Much later in the execve() code path, the file metadata (specifically mode, uid, and gid) is used to determine if/how to set the uid and gid. However, those values may have changed since the permissions check, meaning the execution may gain unintended privileges.
For example, if a file could change permissions from executable and not set-id:
---------x 1 root root 16048 Aug 7 13:16 target
to set-id and non-executable:
---S------ 1 root root 16048 Aug 7 13:16 target
it is possible to gain root privileges when execution should have been disallowed.
While this race condition is rare in real-world scenarios, it has been observed (and proven exploitable) when package managers are updating the setuid bits of installed programs. Such files start with being world-executable but then are adjusted to be group-exec with a set-uid bit. For example, "chmod o-x,u+s target" makes "target" executable only by uid "root" and gid "cdrom", while also becoming setuid-root:
-rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target
becomes:
-rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target
But racing the chmod means users without group "cdrom" membership can get the permission to execute "target" just before the chmod, and when the chmod finishes, the exec reaches brpm_fill_uid(), and performs the setuid to root, violating the expressed authorization of "only cdrom group members can setuid to root".
Re-check that we still have execute permissions in case the metadata has changed. It would be better to keep a copy from the perm-check time, but until we can do that refactoring, the least-bad option is to do a full inode_permission() call (under inode lock). It is understood that this is safe against dead-locks, but hardly optimal.
Reported-by: Marco Vanotti <[email protected]> Tested-by: Marco Vanotti <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Cc: [email protected] Cc: Eric Biederman <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Signed-off-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1 |
|
| #
78eb4ea2 |
| 24-Jul-2024 |
Joel Granados <[email protected]> |
sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ct
sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified.
This patch has been generated by the following coccinelle script:
``` virtual patch
@r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@ identifier func, ctl, write, buffer, lenp, ppos; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... }
@r3@ identifier func; @@
int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *);
@r4@ identifier func, ctl; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *);
@r5@ identifier func, write, buffer, lenp, ppos; @@
int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration.
Co-developed-by: Thomas Weißschuh <[email protected]> Signed-off-by: Thomas Weißschuh <[email protected]> Co-developed-by: Joel Granados <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
| #
b6f5ee4d |
| 20-Jul-2024 |
Kees Cook <[email protected]> |
execve: Move KUnit tests to tests/ subdirectory
Move the exec KUnit tests into a separate directory to avoid polluting the local directory namespace. Additionally update MAINTAINERS for the new file
execve: Move KUnit tests to tests/ subdirectory
Move the exec KUnit tests into a separate directory to avoid polluting the local directory namespace. Additionally update MAINTAINERS for the new files.
Reviewed-by: David Gow <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5 |
|
| #
21f93108 |
| 21-Jun-2024 |
Kees Cook <[email protected]> |
exec: Avoid pathological argc, envc, and bprm->p values
Make sure nothing goes wrong with the string counters or the bprm's belief about the stack pointer. Add checks and matching self-tests.
Take
exec: Avoid pathological argc, envc, and bprm->p values
Make sure nothing goes wrong with the string counters or the bprm's belief about the stack pointer. Add checks and matching self-tests.
Take special care for !CONFIG_MMU, since argmin is not exposed there.
For 32-bit validation, 32-bit UML was used: $ tools/testing/kunit/kunit.py run \ --make_options CROSS_COMPILE=i686-linux-gnu- \ --make_options SUBARCH=i386 \ exec
For !MMU validation, m68k was used: $ tools/testing/kunit/kunit.py run \ --arch m68k --make_option CROSS_COMPILE=m68k-linux-gnu- \ exec
Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
show more ...
|
| #
084ebf7c |
| 21-Jun-2024 |
Kees Cook <[email protected]> |
execve: Keep bprm->argmin behind CONFIG_MMU
When argmin was added in commit 655c16a8ce9c ("exec: separate MM_ANONPAGES and RLIMIT_STACK accounting"), it was intended only for validating stack limits
execve: Keep bprm->argmin behind CONFIG_MMU
When argmin was added in commit 655c16a8ce9c ("exec: separate MM_ANONPAGES and RLIMIT_STACK accounting"), it was intended only for validating stack limits on CONFIG_MMU[1]. All checking for reaching the limit (argmin) is wrapped in CONFIG_MMU ifdef checks, though setting argmin was not. That argmin is only supposed to be used under CONFIG_MMU was rediscovered recently[2], and I don't want to trip over this again.
Move argmin's declaration into the existing CONFIG_MMU area, and add helpers functions so the MMU tests can be consolidated.
Link: https://lore.kernel.org/all/[email protected] [1] Link: https://lore.kernel.org/all/202406211253.7037F69@keescook/ [2] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1 |
|
| #
60371f43 |
| 20-May-2024 |
Kees Cook <[email protected]> |
exec: Add KUnit test for bprm_stack_limits()
Since bprm_stack_limits() operates with very limited side-effects, add it as the first exec.c KUnit test. Add to Kconfig and adjust MAINTAINERS file to i
exec: Add KUnit test for bprm_stack_limits()
Since bprm_stack_limits() operates with very limited side-effects, add it as the first exec.c KUnit test. Add to Kconfig and adjust MAINTAINERS file to include it.
Tested on 64-bit UML: $ tools/testing/kunit/kunit.py run exec
Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Kees Cook <[email protected]>
show more ...
|
| #
2a010c41 |
| 31-May-2024 |
Christian Brauner <[email protected]> |
fs: don't block i_writecount during exec
Back in 2021 we already discussed removing deny_write_access() for executables. Back then I was hesistant because I thought that this might cause issues in u
fs: don't block i_writecount during exec
Back in 2021 we already discussed removing deny_write_access() for executables. Back then I was hesistant because I thought that this might cause issues in userspace. But even back then I had started taking some notes on what could potentially depend on this and I didn't come up with a lot so I've changed my mind and I would like to try this.
Here are some of the notes that I took:
(1) The deny_write_access() mechanism is causing really pointless issues such as [1]. If a thread in a thread-group opens a file writable, then writes some stuff, then closing the file descriptor and then calling execve() they can fail the execve() with ETXTBUSY because another thread in the thread-group could have concurrently called fork(). Multi-threaded libraries such as go suffer from this.
(2) There are userspace attacks that rely on overwriting the binary of a running process. These attacks are _mitigated_ but _not at all prevented_ from ocurring by the deny_write_access() mechanism.
I'll go over some details. The clearest example of such attacks was the attack against runC in CVE-2019-5736 (cf. [3]).
An attack could compromise the runC host binary from inside a _privileged_ runC container. The malicious binary could then be used to take over the host.
(It is crucial to note that this attack is _not_ possible with unprivileged containers. IOW, the setup here is already insecure.)
The attack can be made when attaching to a running container or when starting a container running a specially crafted image. For example, when runC attaches to a container the attacker can trick it into executing itself.
This could be done by replacing the target binary inside the container with a custom binary pointing back at the runC binary itself. As an example, if the target binary was /bin/bash, this could be replaced with an executable script specifying the interpreter path #!/proc/self/exe.
As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed. That magic link will point to the runc binary on the host. The attacker can then proceed to write to the target of /proc/self/exe to try and overwrite the runC binary on the host.
However, this will not succeed because of deny_write_access(). Now, one might think that this would prevent the attack but it doesn't.
To overcome this, the attacker has multiple ways: * Open a file descriptor to /proc/self/exe using the O_PATH flag and then proceed to reopen the binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it in a busy loop from a separate process. Ultimately it will succeed when the runC binary exits. After this the runC binary is compromised and can be used to attack other containers or the host itself. * Use a malicious shared library annotating a function in there with the constructor attribute making the malicious function run as an initializor. The malicious library will then open /proc/self/exe for creating a new entry under /proc/self/fd/<nr>. It'll then call exec to a) force runC to exit and b) hand the file descriptor off to a program that then reopens /proc/self/fd/<nr> for writing (which is now possible because runC has exited) and overwriting that binary.
To sum up: the deny_write_access() mechanism doesn't prevent such attacks in insecure setups. It just makes them minimally harder. That's all.
The only way back then to prevent this is to create a temporary copy of the calling binary itself when it starts or attaches to containers. So what I did back then for LXC (and Aleksa for runC) was to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. This sealed, in-memory file copy is then executed instead of the original on-disk binary.
Any compromising write operations from a privileged container to the host binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host binary. Also as the temporary, in-memory binary is sealed, writes to this will also fail.
The point is that deny_write_access() is uselss to prevent these attacks.
(3) Denying write access to an inode because it's currently used in an exec path could easily be done on an LSM level. It might need an additional hook but that should be about it.
(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time ago so while we do protect the main executable the bigger portion of the things you'd think need protecting such as the shared libraries aren't. IOW, we let anyone happily overwrite shared libraries.
(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means: (5.1) We removed the legacy uselib() protection for preventing overwriting of shared libraries. Nobody cared in 3 years. (5.2) We allow write access to the elf interpreter after exec completed treating it on a par with shared libraries.
Yes, someone in userspace could potentially be relying on this. It's not completely out of the realm of possibility but let's find out if that's actually the case and not guess.
Link: https://github.com/golang/go/issues/22315 [1] Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2] Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3] Link: https://lwn.net/Articles/866493 Link: https://github.com/golang/go/issues/22220 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61 Link: https://github.com/buildkite/agent/pull/2736 Link: https://github.com/rust-lang/rust/issues/114554 Link: https://bugs.openjdk.org/browse/JDK-8068370 Link: https://github.com/dotnet/runtime/issues/58964 Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2 |
|
| #
3a9e567c |
| 28-Mar-2024 |
Jinjiang Tu <[email protected]> |
mm/ksm: fix ksm exec support for prctl
Patch series "mm/ksm: fix ksm exec support for prctl", v4.
commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a t
mm/ksm: fix ksm exec support for prctl
Patch series "mm/ksm: fix ksm exec support for prctl", v4.
commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). However, it doesn't create the mm_slot, so ksmd will not try to scan this task. The first patch fixes the issue.
The second patch refactors to prepare for the third patch. The third patch extends the selftests of ksm to verfity the deduplication really happens after fork/exec inherits ths KSM setting.
This patch (of 3):
commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). Howerver, it doesn't create the mm_slot, so ksmd will not try to scan this task.
To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init() when the mm has MMF_VM_MERGE_ANY flag.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") Signed-off-by: Jinjiang Tu <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stefan Roesch <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|