History log of /linux-6.15/fs/exec.c (Results 1 – 25 of 634)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1
# 169eae77 27-Mar-2025 Mathieu Desnoyers <[email protected]>

rseq: Eliminate useless task_work on execve

Eliminate a useless task_work on execve by moving the call to
rseq_set_notify_resume() from sched_mm_cid_after_execve() to the error
path of bprm_execve()

rseq: Eliminate useless task_work on execve

Eliminate a useless task_work on execve by moving the call to
rseq_set_notify_resume() from sched_mm_cid_after_execve() to the error
path of bprm_execve().

The call to rseq_set_notify_resume() from sched_mm_cid_after_execve() is
pointless in the success case, because rseq_execve() will clear the rseq
pointer before returning to userspace.

sched_mm_cid_after_execve() is called from both the success and error
paths of bprm_execve(). The call to rseq_set_notify_resume() is needed
on error because the mm_cid may have changed.

Also move the rseq_execve() to right after sched_mm_cid_after_execve()
in bprm_execve().

[ mingo: Merged to a recent upstream kernel, extended the changelog. ]

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# af7bb0d2 24-Mar-2025 Oleg Nesterov <[email protected]>

exec: fix the racy usage of fs_struct->in_exec

check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve()
paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it
f

exec: fix the racy usage of fs_struct->in_exec

check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve()
paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it
fails we have the following race:

T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex

T2 sets fs->in_exec = 1

T1 clears fs->in_exec

T2 continues with fs->in_exec == 0

Change fs/exec.c to clear fs->in_exec with cred_guard_mutex held.

Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]/
Cc: [email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5
# cc9554e6 23-Feb-2025 Yonatan Goldschmidt <[email protected]>

binfmt: Remove loader from linux_binprm struct

Commit 987f20a9dcce ("a.out: Remove the a.out implementation") removed
the last in-tree user of the loader field, and as far as I can tell, it
was the

binfmt: Remove loader from linux_binprm struct

Commit 987f20a9dcce ("a.out: Remove the a.out implementation") removed
the last in-tree user of the loader field, and as far as I can tell, it
was the only one historically.

Signed-off-by: Yonatan Goldschmidt <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>

show more ...


Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1
# 1751f872 28-Jan-2025 Joel Granados <[email protected]>

treewide: const qualify ctl_tables where applicable

Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysc

treewide: const qualify ctl_tables where applicable

Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
virtual patch

@
depends on !(file in "net")
disable optional_qualifier
@

identifier table_name != {
watchdog_hardlockup_sysctl,
iwcm_ctl_table,
ucma_ctl_table,
memory_allocation_profiling_sysctls,
loadpin_sysctl_table
};
@@

+ const
struct ctl_table table_name [] = { ... };

sed:
sed --in-place \
-e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
kernel/utsname_sysctl.c

Reviewed-by: Song Liu <[email protected]>
Acked-by: Steven Rostedt (Google) <[email protected]> # for kernel/trace/
Reviewed-by: Martin K. Petersen <[email protected]> # SCSI
Reviewed-by: Darrick J. Wong <[email protected]> # xfs
Acked-by: Jani Nikula <[email protected]>
Acked-by: Corey Minyard <[email protected]>
Acked-by: Wei Liu <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Reviewed-by: Bill O'Donnell <[email protected]>
Acked-by: Baoquan He <[email protected]>
Acked-by: Ashutosh Dixit <[email protected]>
Acked-by: Anna Schumaker <[email protected]>
Signed-off-by: Joel Granados <[email protected]>

show more ...


Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2
# 7a571499 03-Dec-2024 Lorenzo Stoakes <[email protected]>

mm: abstract get_arg_page() stack expansion and mmap read lock

Right now fs/exec.c invokes expand_downwards(), an otherwise internal
implementation detail of the VMA logic in order to ensure that an

mm: abstract get_arg_page() stack expansion and mmap read lock

Right now fs/exec.c invokes expand_downwards(), an otherwise internal
implementation detail of the VMA logic in order to ensure that an arg page
can be obtained by get_user_pages_remote().

In order to be able to move the stack expansion logic into mm/vma.c to
make it available to userland testing we need to find an alternative
approach here.

We do so by providing the mmap_read_lock_maybe_expand() function which
also helpfully documents what get_arg_page() is doing here and adds an
additional check against VM_GROWSDOWN to make explicit that the stack
expansion logic is only invoked when the VMA is indeed a downward-growing
stack.

This allows expand_downwards() to become a static function.

Importantly, the VMA referenced by mmap_read_maybe_expand() must NOT be
currently user-visible in any way, that is place within an rmap or VMA
tree. It must be a newly allocated VMA.

This is the case when exec invokes this function.

Link: https://lkml.kernel.org/r/5295d1c70c58e6aa63d14be68d4e1de9fa1c8e6d.1733248985.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# a5874fde 12-Dec-2024 Mickaël Salaün <[email protected]>

exec: Add a new AT_EXECVE_CHECK flag to execveat(2)

Add a new AT_EXECVE_CHECK flag to execveat(2) to check if a file would
be allowed for execution. The main use case is for script interpreters
and

exec: Add a new AT_EXECVE_CHECK flag to execveat(2)

Add a new AT_EXECVE_CHECK flag to execveat(2) to check if a file would
be allowed for execution. The main use case is for script interpreters
and dynamic linkers to check execution permission according to the
kernel's security policy. Another use case is to add context to access
logs e.g., which script (instead of interpreter) accessed a file. As
any executable code, scripts could also use this check [1].

This is different from faccessat(2) + X_OK which only checks a subset of
access rights (i.e. inode permission and mount options for regular
files), but not the full context (e.g. all LSM access checks). The main
use case for access(2) is for SUID processes to (partially) check access
on behalf of their caller. The main use case for execveat(2) +
AT_EXECVE_CHECK is to check if a script execution would be allowed,
according to all the different restrictions in place. Because the use
of AT_EXECVE_CHECK follows the exact kernel semantic as for a real
execution, user space gets the same error codes.

An interesting point of using execveat(2) instead of openat2(2) is that
it decouples the check from the enforcement. Indeed, the security check
can be logged (e.g. with audit) without blocking an execution
environment not yet ready to enforce a strict security policy.

LSMs can control or log execution requests with
security_bprm_creds_for_exec(). However, to enforce a consistent and
complete access control (e.g. on binary's dependencies) LSMs should
restrict file executability, or measure executed files, with
security_file_open() by checking file->f_flags & __FMODE_EXEC.

Because AT_EXECVE_CHECK is dedicated to user space interpreters, it
doesn't make sense for the kernel to parse the checked files, look for
interpreters known to the kernel (e.g. ELF, shebang), and return ENOEXEC
if the format is unknown. Because of that, security_bprm_check() is
never called when AT_EXECVE_CHECK is used.

It should be noted that script interpreters cannot directly use
execveat(2) (without this new AT_EXECVE_CHECK flag) because this could
lead to unexpected behaviors e.g., `python script.sh` could lead to Bash
being executed to interpret the script. Unlike the kernel, script
interpreters may just interpret the shebang as a simple comment, which
should not change for backward compatibility reasons.

Because scripts or libraries files might not currently have the
executable permission set, or because we might want specific users to be
allowed to run arbitrary scripts, the following patch provides a dynamic
configuration mechanism with the SECBIT_EXEC_RESTRICT_FILE and
SECBIT_EXEC_DENY_INTERACTIVE securebits.

This is a redesign of the CLIP OS 4's O_MAYEXEC:
https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch
This patch has been used for more than a decade with customized script
interpreters. Some examples can be found here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC

Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Kees Cook <[email protected]>
Acked-by: Paul Moore <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Reviewed-by: Jeff Xu <[email protected]>
Tested-by: Jeff Xu <[email protected]>
Link: https://docs.python.org/3/library/io.html#io.open_code [1]
Signed-off-by: Mickaël Salaün <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>

show more ...


Revision tags: v6.13-rc1
# 543841d1 21-Nov-2024 Kees Cook <[email protected]>

exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case

Zbigniew mentioned at Linux Plumber's that systemd is interested in
switching to execveat() for service execution, but can't, because

exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case

Zbigniew mentioned at Linux Plumber's that systemd is interested in
switching to execveat() for service execution, but can't, because the
contents of /proc/pid/comm are the file descriptor which was used,
instead of the path to the binary[1]. This makes the output of tools like
top and ps useless, especially in a world where most fds are opened
CLOEXEC so the number is truly meaningless.

When the filename passed in is empty (e.g. with AT_EMPTY_PATH), use the
dentry's filename for "comm" instead of using the useless numeral from
the synthetic fdpath construction. This way the actual exec machinery
is unchanged, but cosmetically the comm looks reasonable to admins
investigating things.

Instead of adding TASK_COMM_LEN more bytes to bprm, use one of the unused
flag bits to indicate that we need to set "comm" from the dentry.

Suggested-by: Zbigniew Jędrzejewski-Szmek <[email protected]>
Suggested-by: Tycho Andersen <[email protected]>
Suggested-by: Al Viro <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Link: https://github.com/uapi-group/kernel-features#set-comm-field-before-exec [1]
Reviewed-by: Aleksa Sarai <[email protected]>
Tested-by: Zbigniew Jędrzejewski-Szmek <[email protected]>
Signed-off-by: Kees Cook <[email protected]>

show more ...


# 3a3f61ce 30-Nov-2024 Kees Cook <[email protected]>

exec: Make sure task->comm is always NUL-terminated

Using strscpy() meant that the final character in task->comm may be
non-NUL for a moment before the "string too long" truncation happens.

Instead

exec: Make sure task->comm is always NUL-terminated

Using strscpy() meant that the final character in task->comm may be
non-NUL for a moment before the "string too long" truncation happens.

Instead of adding a new use of the ambiguous strncpy(), we'd want to
use memtostr_pad() which enforces being able to check at compile time
that sizes are sensible, but this requires being able to see string
buffer lengths. Instead of trying to inline __set_task_comm() (which
needs to call trace and perf functions), just open-code it. But to
make sure we're always safe, add compile-time checking like we already
do for get_task_comm().

Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Kees Cook <[email protected]>

show more ...


# 0357ef03 28-Nov-2024 Amir Goldstein <[email protected]>

fs: don't block write during exec on pre-content watched files

Commit 2a010c412853 ("fs: don't block i_writecount during exec") removed
the legacy behavior of getting ETXTBSY on attempt to open and

fs: don't block write during exec on pre-content watched files

Commit 2a010c412853 ("fs: don't block i_writecount during exec") removed
the legacy behavior of getting ETXTBSY on attempt to open and executable
file for write while it is being executed.

This commit was reverted because an application that depends on this
legacy behavior was broken by the change.

We need to allow HSM writing into executable files while executed to
fill their content on-the-fly.

To that end, disable the ETXTBSY legacy behavior for files that are
watched by pre-content events.

This change is not expected to cause regressions with existing systems
which do not have any pre-content event listeners.

Signed-off-by: Amir Goldstein <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Link: https://patch.msgid.link/[email protected]

show more ...


Revision tags: v6.12
# fa1bdca9 16-Nov-2024 Nir Lichtman <[email protected]>

exec: remove legacy custom binfmt modules autoloading

Problem: The search binary handler logic contains legacy code
to handle automatically loading kernel modules of unsupported
binary formats.
This

exec: remove legacy custom binfmt modules autoloading

Problem: The search binary handler logic contains legacy code
to handle automatically loading kernel modules of unsupported
binary formats.
This logic is a leftover from a.out-to-ELF transition.
After removal of a.out support, this code has no use anymore.

Solution: Clean up this code from the search binary handler,
also remove the line initialising retval to -ENOENT and instead
just return -ENOEXEC if the flow has reached the end of the func.

Note: Anyone who might find future uses for this legacy code
would be better off using binfmt_misc to trigger whatever
module loading they might need - would be more flexible that way.

Suggested-by: Alexander Viro <[email protected]>
Signed-off-by: Nir Lichtman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>

show more ...


Revision tags: v6.12-rc7, v6.12-rc6
# 4188fc31 02-Nov-2024 [email protected] <[email protected]>

exec: move warning of null argv to be next to the relevant code

Problem: The warning is currently printed where it is detected that the
arg count is zero but the action is only taken place later in

exec: move warning of null argv to be next to the relevant code

Problem: The warning is currently printed where it is detected that the
arg count is zero but the action is only taken place later in the flow
even though the warning is written as if the action is taken place in
the time of print

This could be problematic since there could be a failure between the
print and the code that takes action which would deem this warning
misleading

Solution: Move the warning print after the action of adding an empty
string as the first argument is successful

Signed-off-by: Nir Lichtman <[email protected]>
Link: https://lore.kernel.org/r/ZyYUgiPc8A8i_3FH@nirs-laptop.
Signed-off-by: Kees Cook <[email protected]>

show more ...


# 3b832035 27-Nov-2024 Christian Brauner <[email protected]>

Revert "fs: don't block i_writecount during exec"

This reverts commit 2a010c41285345da60cece35575b4e0af7e7bf44.

Rui Ueyama <[email protected]> writes:

> I'm the creator and the maintainer of the mo

Revert "fs: don't block i_writecount during exec"

This reverts commit 2a010c41285345da60cece35575b4e0af7e7bf44.

Rui Ueyama <[email protected]> writes:

> I'm the creator and the maintainer of the mold linker
> (https://github.com/rui314/mold). Recently, we discovered that mold
> started causing process crashes in certain situations due to a change
> in the Linux kernel. Here are the details:
>
> - In general, overwriting an existing file is much faster than
> creating an empty file and writing to it on Linux, so mold attempts to
> reuse an existing executable file if it exists.
>
> - If a program is running, opening the executable file for writing
> previously failed with ETXTBSY. If that happens, mold falls back to
> creating a new file.
>
> - However, the Linux kernel recently changed the behavior so that
> writing to an executable file is now always permitted
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a010c412853).
>
> That caused mold to write to an executable file even if there's a
> process running that file. Since changes to mmap'ed files are
> immediately visible to other processes, any processes running that
> file would almost certainly crash in a very mysterious way.
> Identifying the cause of these random crashes took us a few days.
>
> Rejecting writes to an executable file that is currently running is a
> well-known behavior, and Linux had operated that way for a very long
> time. So, I don’t believe relying on this behavior was our mistake;
> rather, I see this as a regression in the Linux kernel.

Quoting myself from commit 2a010c412853 ("fs: don't block i_writecount during exec")

> Yes, someone in userspace could potentially be relying on this. It's not
> completely out of the realm of possibility but let's find out if that's
> actually the case and not guess.

It seems we found out that someone is relying on this obscure behavior.
So revert the change.

Link: https://github.com/rui314/mold/issues/1361
Link: https://lore.kernel.org/r/[email protected]
Cc: <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.12-rc5, v6.12-rc4, v6.12-rc3
# 4cc0473d 07-Oct-2024 Yafang Shao <[email protected]>

get rid of __get_task_comm()

Patch series "Improve the copy of task comm", v8.

Using {memcpy,strncpy,strcpy,kstrdup} to copy the task comm relies on the
length of task comm. Changes in the task co

get rid of __get_task_comm()

Patch series "Improve the copy of task comm", v8.

Using {memcpy,strncpy,strcpy,kstrdup} to copy the task comm relies on the
length of task comm. Changes in the task comm could result in a
destination string that is overflow. Therefore, we should explicitly
ensure the destination string is always NUL-terminated, regardless of the
task comm. This approach will facilitate future extensions to the task
comm.

As suggested by Linus [0], we can identify all relevant code with the
following git grep command:

git grep 'memcpy.*->comm\>'
git grep 'kstrdup.*->comm\>'
git grep 'strncpy.*->comm\>'
git grep 'strcpy.*->comm\>'

PATCH #2~#4: memcpy
PATCH #5~#6: kstrdup
PATCH #7: strcpy

Please note that strncpy() is not included in this series as it is being
tracked by another effort. [1]


This patch (of 7):

We want to eliminate the use of __get_task_comm() for the following
reasons:

- The task_lock() is unnecessary
Quoted from Linus [0]:
: Since user space can randomly change their names anyway, using locking
: was always wrong for readers (for writers it probably does make sense
: to have some lock - although practically speaking nobody cares there
: either, but at least for a writer some kind of race could have
: long-term mixed results

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wivfrF0_zvf+oj6==Sh=-npJooP8chLPEfaFV0oNYTTBA@mail.gmail.com [0]
Link: https://lore.kernel.org/all/CAHk-=whWtUC-AjmGJveAETKOMeMFSTwKwu99v7+b6AyHMmaDFA@mail.gmail.com/
Link: https://lore.kernel.org/all/CAHk-=wjAmmHUg6vho1KjzQi2=psR30+CogFd4aXrThr2gsiS4g@mail.gmail.com/ [0]
Link: https://github.com/KSPP/linux/issues/90 [1]
Signed-off-by: Yafang Shao <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Matus Jokay <[email protected]>
Cc: Alejandro Colomar <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Justin Stitt <[email protected]>
Cc: Steven Rostedt (Google) <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: James Morris <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Maxime Ripard <[email protected]>
Cc: Ondrej Mosnacek <[email protected]>
Cc: Paul Moore <[email protected]>
Cc: Quentin Monnet <[email protected]>
Cc: Simon Horman <[email protected]>
Cc: Stephen Smalley <[email protected]>
Cc: Thomas Zimmermann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 7e019dcc 09-Oct-2024 Mathieu Desnoyers <[email protected]>

sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concu

sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.

These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.

However, intermittent workloads behaving in bursts spaced by more than
100ms on each CPU exhibit bad cache locality and degraded performance
compared to purely per-cpu data indexing, because concurrency IDs are
allocated over various CPUs and cores, therefore losing cache locality
of the associated data.

Introduce the following changes to improve per-mm-cid cache locality:

- Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
track of which mm_cid value was last used, and use it as a hint to
attempt re-allocating the same concurrency ID the next time this
mm/cpu needs to allocate a concurrency ID,

- Add a per-mm CPUs allowed mask, which keeps track of the union of
CPUs allowed for all threads belonging to this mm. This cpumask is
only set during the lifetime of the mm, never cleared, so it
represents the union of all the CPUs allowed since the beginning of
the mm lifetime (note that the mm_cpumask() is really arch-specific
and tailored to the TLB flush needs, and is thus _not_ a viable
approach for this),

- Add a per-mm nr_cpus_allowed to keep track of the weight of the
per-mm CPUs allowed mask (for fast access),

- Add a per-mm max_nr_cid to keep track of the highest number of
concurrency IDs allocated for the mm. This is used for expanding the
concurrency ID allocation within the upper bound defined by:

min(mm->nr_cpus_allowed, mm->mm_users)

When the next unused CID value reaches this threshold, stop trying
to expand the cid allocation and use the first available cid value
instead.

Spreading allocation to use all the cid values within the range

[ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]

improves cache locality while preserving mm_cid compactness within the
expected user limits,

- In __mm_cid_try_get, only return cid values within the range
[ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
prevents allocating cids above the number of allowed cpus in
rare scenarios where cid allocation races with a concurrent
remote-clear of the per-mm/cpu cid. This improvement is made
possible by the addition of the per-mm CPUs allowed mask,

- In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
t->nr_cpus_allowed. This criterion was really meant to compare
the number of mm->mm_users to the number of CPUs allowed for the
entire mm. Therefore, the prior comparison worked fine when all
threads shared the same CPUs allowed mask, but not so much in
scenarios where those threads have different masks (e.g. each
thread pinned to a single CPU). This improvement is made
possible by the addition of the per-mm CPUs allowed mask.

* Benchmarks

Each thread increments 16kB worth of 8-bit integers in bursts, with
a configurable delay between each thread's execution. Each thread run
one after the other (no threads run concurrently). The order of
thread execution in the sequence is random. The thread execution
sequence begins again after all threads have executed. The 16kB areas
are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
(not cache-local), or cache-local mm_cid. Each thread is pinned to its
own core.

Testing configurations:

8-core/1-L3: Use 8 cores within a single L3
24-core/24-L3: Use 24 cores, 1 core per L3
192-core/24-L3: Use 192 cores (all cores in the system)
384-thread/24-L3: Use 384 HW threads (all HW threads in the system)

Intermittent workload delays between threads: 200ms, 10ms.

Hardware:

CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Caches (sum of all):
L1d: 6 MiB (192 instances)
L1i: 6 MiB (192 instances)
L2: 192 MiB (192 instances)
L3: 768 MiB (24 instances)

Each result is an average of 5 test runs. The cache-local speedup
is calculated as: (cache-local mm_cid) / (mm_cid).

Intermittent workload delay: 200ms

per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 1374 19289 1336 14.4x
24-core/24-L3 2423 26721 1594 16.7x
192-core/24-L3 2291 15826 2153 7.3x
384-thread/24-L3 1874 13234 1907 6.9x

Intermittent workload delay: 10ms

per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 662 756 686 1.1x
24-core/24-L3 1378 3648 1035 3.5x
192-core/24-L3 1439 10833 1482 7.3x
384-thread/24-L3 1503 10570 1556 6.8x

[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
patch series with a simpler and more general approach. ]

[ This patch applies on top of v6.12-rc1. ]

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Marco Elver <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/

show more ...


Revision tags: v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7
# f31b2569 07-Sep-2024 Helge Deller <[email protected]>

parisc: Fix stack start for ADDR_NO_RANDOMIZE personality

Fix the stack start address calculation for the parisc architecture in
setup_arg_pages() when address randomization is disabled. When the
AD

parisc: Fix stack start for ADDR_NO_RANDOMIZE personality

Fix the stack start address calculation for the parisc architecture in
setup_arg_pages() when address randomization is disabled. When the
ADDR_NO_RANDOMIZE process personality is disabled there is no need to add
additional space for the stack.
Note that this patch touches code inside an #ifdef CONFIG_STACK_GROWSUP hunk,
which is why only the parisc architecture is affected since it's the
only Linux architecture where the stack grows upwards.

Without this patch you will find the stack in the middle of some
mapped libaries and suddenly limited to 6MB instead of 8MB:

root@parisc:~# setarch -R /bin/bash -c "cat /proc/self/maps"
00010000-00019000 r-xp 00000000 08:05 1182034 /usr/bin/cat
00019000-0001a000 rwxp 00009000 08:05 1182034 /usr/bin/cat
0001a000-0003b000 rwxp 00000000 00:00 0 [heap]
f90c4000-f9283000 r-xp 00000000 08:05 1573004 /usr/lib/hppa-linux-gnu/libc.so.6
f9283000-f9285000 r--p 001bf000 08:05 1573004 /usr/lib/hppa-linux-gnu/libc.so.6
f9285000-f928a000 rwxp 001c1000 08:05 1573004 /usr/lib/hppa-linux-gnu/libc.so.6
f928a000-f9294000 rwxp 00000000 00:00 0
f9301000-f9323000 rwxp 00000000 00:00 0 [stack]
f98b4000-f98e4000 r-xp 00000000 08:05 1572869 /usr/lib/hppa-linux-gnu/ld.so.1
f98e4000-f98e5000 r--p 00030000 08:05 1572869 /usr/lib/hppa-linux-gnu/ld.so.1
f98e5000-f98e9000 rwxp 00031000 08:05 1572869 /usr/lib/hppa-linux-gnu/ld.so.1
f9ad8000-f9b00000 rw-p 00000000 00:00 0
f9b00000-f9b01000 r-xp 00000000 00:00 0 [vdso]

With the patch the stack gets correctly mapped at the end
of the process memory map:

root@panama:~# setarch -R /bin/bash -c "cat /proc/self/maps"
00010000-00019000 r-xp 00000000 08:13 16385582 /usr/bin/cat
00019000-0001a000 rwxp 00009000 08:13 16385582 /usr/bin/cat
0001a000-0003b000 rwxp 00000000 00:00 0 [heap]
fef29000-ff0eb000 r-xp 00000000 08:13 16122400 /usr/lib/hppa-linux-gnu/libc.so.6
ff0eb000-ff0ed000 r--p 001c2000 08:13 16122400 /usr/lib/hppa-linux-gnu/libc.so.6
ff0ed000-ff0f2000 rwxp 001c4000 08:13 16122400 /usr/lib/hppa-linux-gnu/libc.so.6
ff0f2000-ff0fc000 rwxp 00000000 00:00 0
ff4b4000-ff4e4000 r-xp 00000000 08:13 16121913 /usr/lib/hppa-linux-gnu/ld.so.1
ff4e4000-ff4e6000 r--p 00030000 08:13 16121913 /usr/lib/hppa-linux-gnu/ld.so.1
ff4e6000-ff4ea000 rwxp 00032000 08:13 16121913 /usr/lib/hppa-linux-gnu/ld.so.1
ff6d7000-ff6ff000 rw-p 00000000 00:00 0
ff6ff000-ff700000 r-xp 00000000 00:00 0 [vdso]
ff700000-ff722000 rwxp 00000000 00:00 0 [stack]

Reported-by: Camm Maguire <[email protected]>
Signed-off-by: Helge Deller <[email protected]>
Fixes: d045c77c1a69 ("parisc,metag: Fix crashes due to stack randomization on stack-grows-upwards architectures")
Fixes: 17d9822d4b4c ("parisc: Consider stack randomization for mmap base only when necessary")
Cc: [email protected] # v5.2+

show more ...


Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2
# d61f0d59 29-Jul-2024 Lorenzo Stoakes <[email protected]>

mm: move vma_shrink(), vma_expand() to internal header

The vma_shrink() and vma_expand() functions are internal VMA manipulation
functions which we ought to abstract for use outside of memory manage

mm: move vma_shrink(), vma_expand() to internal header

The vma_shrink() and vma_expand() functions are internal VMA manipulation
functions which we ought to abstract for use outside of memory management
code.

To achieve this, we replace shift_arg_pages() in fs/exec.c with an
invocation of a new relocate_vma_down() function implemented in mm/mmap.c,
which enables us to also move move_page_tables() and vma_iter_prev_range()
to internal.h.

The purpose of doing this is to isolate key VMA manipulation functions in
order that we can both abstract them and later render them easily
testable.

Link: https://lkml.kernel.org/r/3cfcd9ec433e032a85f636fdc0d7d98fafbd19c5.1722251717.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Gow <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Rae Moar <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Pengfei Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 0d196e75 05-Aug-2024 Mateusz Guzik <[email protected]>

exec: don't WARN for racy path_noexec check

Both i_mode and noexec checks wrapped in WARN_ON stem from an artifact
of the previous implementation. They used to legitimately check for the
condition,

exec: don't WARN for racy path_noexec check

Both i_mode and noexec checks wrapped in WARN_ON stem from an artifact
of the previous implementation. They used to legitimately check for the
condition, but that got moved up in two commits:
633fb6ac3980 ("exec: move S_ISREG() check earlier")
0fd338b2d2cd ("exec: move path_noexec() check earlier")

Instead of being removed said checks are WARN_ON'ed instead, which
has some debug value.

However, the spurious path_noexec check is racy, resulting in
unwarranted warnings should someone race with setting the noexec flag.

One can note there is more to perm-checking whether execve is allowed
and none of the conditions are guaranteed to still hold after they were
tested for.

Additionally this does not validate whether the code path did any perm
checking to begin with -- it will pass if the inode happens to be
regular.

Keep the redundant path_noexec() check even though it's mindless
nonsense checking for guarantee that isn't given so drop the WARN.

Reword the commentary and do small tidy ups while here.

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[brauner: keep redundant path_noexec() check]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# f50733b4 08-Aug-2024 Kees Cook <[email protected]>

exec: Fix ToCToU between perm check and set-uid/gid usage

When opening a file for exec via do_filp_open(), permission checking is
done against the file's metadata at that moment, and on success, a f

exec: Fix ToCToU between perm check and set-uid/gid usage

When opening a file for exec via do_filp_open(), permission checking is
done against the file's metadata at that moment, and on success, a file
pointer is passed back. Much later in the execve() code path, the file
metadata (specifically mode, uid, and gid) is used to determine if/how
to set the uid and gid. However, those values may have changed since the
permissions check, meaning the execution may gain unintended privileges.

For example, if a file could change permissions from executable and not
set-id:

---------x 1 root root 16048 Aug 7 13:16 target

to set-id and non-executable:

---S------ 1 root root 16048 Aug 7 13:16 target

it is possible to gain root privileges when execution should have been
disallowed.

While this race condition is rare in real-world scenarios, it has been
observed (and proven exploitable) when package managers are updating
the setuid bits of installed programs. Such files start with being
world-executable but then are adjusted to be group-exec with a set-uid
bit. For example, "chmod o-x,u+s target" makes "target" executable only
by uid "root" and gid "cdrom", while also becoming setuid-root:

-rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target

becomes:

-rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target

But racing the chmod means users without group "cdrom" membership can
get the permission to execute "target" just before the chmod, and when
the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
setuid to root, violating the expressed authorization of "only cdrom
group members can setuid to root".

Re-check that we still have execute permissions in case the metadata
has changed. It would be better to keep a copy from the perm-check time,
but until we can do that refactoring, the least-bad option is to do a
full inode_permission() call (under inode lock). It is understood that
this is safe against dead-locks, but hardly optimal.

Reported-by: Marco Vanotti <[email protected]>
Tested-by: Marco Vanotti <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Eric Biederman <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Kees Cook <[email protected]>

show more ...


Revision tags: v6.11-rc1
# 78eb4ea2 24-Jul-2024 Joel Granados <[email protected]>

sysctl: treewide: constify the ctl_table argument of proc_handlers

const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ct

sysctl: treewide: constify the ctl_table argument of proc_handlers

const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
virtual patch

@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@

int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);

@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@

int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }

@r3@
identifier func;
@@

int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);

@r4@
identifier func, ctl;
@@

int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);

@r5@
identifier func, write, buffer, lenp, ppos;
@@

int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.

Co-developed-by: Thomas Weißschuh <[email protected]>
Signed-off-by: Thomas Weißschuh <[email protected]>
Co-developed-by: Joel Granados <[email protected]>
Signed-off-by: Joel Granados <[email protected]>

show more ...


# b6f5ee4d 20-Jul-2024 Kees Cook <[email protected]>

execve: Move KUnit tests to tests/ subdirectory

Move the exec KUnit tests into a separate directory to avoid polluting
the local directory namespace. Additionally update MAINTAINERS for the
new file

execve: Move KUnit tests to tests/ subdirectory

Move the exec KUnit tests into a separate directory to avoid polluting
the local directory namespace. Additionally update MAINTAINERS for the
new files.

Reviewed-by: David Gow <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>

show more ...


Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5
# 21f93108 21-Jun-2024 Kees Cook <[email protected]>

exec: Avoid pathological argc, envc, and bprm->p values

Make sure nothing goes wrong with the string counters or the bprm's
belief about the stack pointer. Add checks and matching self-tests.

Take

exec: Avoid pathological argc, envc, and bprm->p values

Make sure nothing goes wrong with the string counters or the bprm's
belief about the stack pointer. Add checks and matching self-tests.

Take special care for !CONFIG_MMU, since argmin is not exposed there.

For 32-bit validation, 32-bit UML was used:
$ tools/testing/kunit/kunit.py run \
--make_options CROSS_COMPILE=i686-linux-gnu- \
--make_options SUBARCH=i386 \
exec

For !MMU validation, m68k was used:
$ tools/testing/kunit/kunit.py run \
--arch m68k --make_option CROSS_COMPILE=m68k-linux-gnu- \
exec

Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>

show more ...


# 084ebf7c 21-Jun-2024 Kees Cook <[email protected]>

execve: Keep bprm->argmin behind CONFIG_MMU

When argmin was added in commit 655c16a8ce9c ("exec: separate
MM_ANONPAGES and RLIMIT_STACK accounting"), it was intended only for
validating stack limits

execve: Keep bprm->argmin behind CONFIG_MMU

When argmin was added in commit 655c16a8ce9c ("exec: separate
MM_ANONPAGES and RLIMIT_STACK accounting"), it was intended only for
validating stack limits on CONFIG_MMU[1]. All checking for reaching the
limit (argmin) is wrapped in CONFIG_MMU ifdef checks, though setting
argmin was not. That argmin is only supposed to be used under CONFIG_MMU
was rediscovered recently[2], and I don't want to trip over this again.

Move argmin's declaration into the existing CONFIG_MMU area, and add
helpers functions so the MMU tests can be consolidated.

Link: https://lore.kernel.org/all/[email protected] [1]
Link: https://lore.kernel.org/all/202406211253.7037F69@keescook/ [2]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>

show more ...


Revision tags: v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1
# 60371f43 20-May-2024 Kees Cook <[email protected]>

exec: Add KUnit test for bprm_stack_limits()

Since bprm_stack_limits() operates with very limited side-effects, add
it as the first exec.c KUnit test. Add to Kconfig and adjust MAINTAINERS
file to i

exec: Add KUnit test for bprm_stack_limits()

Since bprm_stack_limits() operates with very limited side-effects, add
it as the first exec.c KUnit test. Add to Kconfig and adjust MAINTAINERS
file to include it.

Tested on 64-bit UML:
$ tools/testing/kunit/kunit.py run exec

Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Kees Cook <[email protected]>

show more ...


# 2a010c41 31-May-2024 Christian Brauner <[email protected]>

fs: don't block i_writecount during exec

Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in u

fs: don't block i_writecount during exec

Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.

Here are some of the notes that I took:

(1) The deny_write_access() mechanism is causing really pointless issues
such as [1]. If a thread in a thread-group opens a file writable,
then writes some stuff, then closing the file descriptor and then
calling execve() they can fail the execve() with ETXTBUSY because
another thread in the thread-group could have concurrently called
fork(). Multi-threaded libraries such as go suffer from this.

(2) There are userspace attacks that rely on overwriting the binary of a
running process. These attacks are _mitigated_ but _not at all
prevented_ from ocurring by the deny_write_access() mechanism.

I'll go over some details. The clearest example of such attacks was
the attack against runC in CVE-2019-5736 (cf. [3]).

An attack could compromise the runC host binary from inside a
_privileged_ runC container. The malicious binary could then be used
to take over the host.

(It is crucial to note that this attack is _not_ possible with
unprivileged containers. IOW, the setup here is already insecure.)

The attack can be made when attaching to a running container or when
starting a container running a specially crafted image. For example,
when runC attaches to a container the attacker can trick it into
executing itself.

This could be done by replacing the target binary inside the
container with a custom binary pointing back at the runC binary
itself. As an example, if the target binary was /bin/bash, this
could be replaced with an executable script specifying the
interpreter path #!/proc/self/exe.

As such when /bin/bash is executed inside the container, instead the
target of /proc/self/exe will be executed. That magic link will
point to the runc binary on the host. The attacker can then proceed
to write to the target of /proc/self/exe to try and overwrite the
runC binary on the host.

However, this will not succeed because of deny_write_access(). Now,
one might think that this would prevent the attack but it doesn't.

To overcome this, the attacker has multiple ways:
* Open a file descriptor to /proc/self/exe using the O_PATH flag and
then proceed to reopen the binary as O_WRONLY through
/proc/self/fd/<nr> and try to write to it in a busy loop from a
separate process. Ultimately it will succeed when the runC binary
exits. After this the runC binary is compromised and can be used
to attack other containers or the host itself.
* Use a malicious shared library annotating a function in there with
the constructor attribute making the malicious function run as an
initializor. The malicious library will then open /proc/self/exe
for creating a new entry under /proc/self/fd/<nr>. It'll then call
exec to a) force runC to exit and b) hand the file descriptor off
to a program that then reopens /proc/self/fd/<nr> for writing
(which is now possible because runC has exited) and overwriting
that binary.

To sum up: the deny_write_access() mechanism doesn't prevent such
attacks in insecure setups. It just makes them minimally harder.
That's all.

The only way back then to prevent this is to create a temporary copy
of the calling binary itself when it starts or attaches to
containers. So what I did back then for LXC (and Aleksa for runC)
was to create an anonymous, in-memory file using the memfd_create()
system call and to copy itself into the temporary in-memory file,
which is then sealed to prevent further modifications. This sealed,
in-memory file copy is then executed instead of the original on-disk
binary.

Any compromising write operations from a privileged container to the
host binary will then write to the temporary in-memory binary and
not to the host binary on-disk, preserving the integrity of the host
binary. Also as the temporary, in-memory binary is sealed, writes to
this will also fail.

The point is that deny_write_access() is uselss to prevent these
attacks.

(3) Denying write access to an inode because it's currently used in an
exec path could easily be done on an LSM level. It might need an
additional hook but that should be about it.

(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
ago so while we do protect the main executable the bigger portion of
the things you'd think need protecting such as the shared libraries
aren't. IOW, we let anyone happily overwrite shared libraries.

(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
(5.1) We removed the legacy uselib() protection for preventing
overwriting of shared libraries. Nobody cared in 3 years.
(5.2) We allow write access to the elf interpreter after exec
completed treating it on a par with shared libraries.

Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.

Link: https://github.com/golang/go/issues/22315 [1]
Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: https://github.com/golang/go/issues/22220
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
Link: https://github.com/buildkite/agent/pull/2736
Link: https://github.com/rust-lang/rust/issues/114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: https://github.com/dotnet/runtime/issues/58964
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2
# 3a9e567c 28-Mar-2024 Jinjiang Tu <[email protected]>

mm/ksm: fix ksm exec support for prctl

Patch series "mm/ksm: fix ksm exec support for prctl", v4.

commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits
MMF_VM_MERGE_ANY flag when a t

mm/ksm: fix ksm exec support for prctl

Patch series "mm/ksm: fix ksm exec support for prctl", v4.

commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits
MMF_VM_MERGE_ANY flag when a task calls execve(). However, it doesn't
create the mm_slot, so ksmd will not try to scan this task. The first
patch fixes the issue.

The second patch refactors to prepare for the third patch. The third
patch extends the selftests of ksm to verfity the deduplication really
happens after fork/exec inherits ths KSM setting.


This patch (of 3):

commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits
MMF_VM_MERGE_ANY flag when a task calls execve(). Howerver, it doesn't
create the mm_slot, so ksmd will not try to scan this task.

To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init()
when the mm has MMF_VM_MERGE_ANY flag.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 3c6f33b7273a ("mm/ksm: support fork/exec for prctl")
Signed-off-by: Jinjiang Tu <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stefan Roesch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


12345678910>>...26