History log of /linux-6.15/kernel/exit.c (Results 1 – 25 of 630)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1
# 9133607d 24-Mar-2025 Oleg Nesterov <[email protected]>

exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit()

Consider a process with a group leader L and a sub-thread T.
L does sys_exit(1), then T does sys_exit_grou

exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit()

Consider a process with a group leader L and a sub-thread T.
L does sys_exit(1), then T does sys_exit_group(2).

In this case wait_task_zombie(L) will notice SIGNAL_GROUP_EXIT and use
L->signal->group_exit_code, this is correct.

But, before that, do_notify_parent(L) called by release_task(T) will use
L->exit_code != L->signal->group_exit_code, and this is not consistent.
We don't really care, I think that nobody relies on the info which comes
with SIGCHLD, if nothing else SIGCHLD < SIGRTMIN can be queued only once.

But pidfs_exit() is more problematic, I think pidfs_exit_info->exit_code
should report ->group_exit_code in this case, just like wait_task_zombie().

TODO: with this change we can hopefully cleanup (or may be even kill) the
similar SIGNAL_GROUP_EXIT checks, at least in wait_task_zombie().

Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14
# 0b7747a5 23-Mar-2025 Oleg Nesterov <[email protected]>

pidfs: cleanup the usage of do_notify_pidfd()

If a single-threaded process exits do_notify_pidfd() will be called twice,
from exit_notify() and right after that from do_notify_parent().

1. Change e

pidfs: cleanup the usage of do_notify_pidfd()

If a single-threaded process exits do_notify_pidfd() will be called twice,
from exit_notify() and right after that from do_notify_parent().

1. Change exit_notify() to call do_notify_pidfd() if the exiting task is
not ptraced and it is not a group leader.

2. Change do_notify_parent() to call do_notify_pidfd() unconditionally.

If tsk is not ptraced, do_notify_parent() will only be called when it
is a group-leader and thread_group_empty() is true.

This means that if tsk is ptraced, do_notify_pidfd() will be called from
do_notify_parent() even if tsk is a delay_group_leader(). But this case is
less common, and apart from the unnecessary __wake_up() is harmless.

Granted, this unnecessary __wake_up() can be avoided, but I don't want to
do it in this patch because it's just a consequence of another historical
oddity: we notify the tracer even if !thread_group_empty(), but do_wait()
from debugger can't work until all other threads exit. With or without this
patch we should either eliminate do_notify_parent() in this case, or change
do_wait(WEXITED) to untrace the ptraced delay_group_leader() at least when
ptrace_reparented().

Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 0fb48272 20-Mar-2025 Christian Brauner <[email protected]>

pidfs: improve multi-threaded exec and premature thread-group leader exit polling

This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit

pidfs: improve multi-threaded exec and premature thread-group leader exit polling

This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit consistent.

A quick recap of these two cases:

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
leader thread, all other threads in the thread-group including the
thread-group leader are killed and the struct pid of the
thread-group leader will be taken over by the subthread that called
exec. IOW, two tasks change their TIDs.

(2) A premature thread-group leader exit means that the thread-group
leader exited before all of the other subthreads in the thread-group
have exited.

Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.

The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.

But this is difficult to handle because a thread-group may exit
prematurely as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.

So far there was no way to distinguish between (1) and (2) internally.
This tiny series tries to address this problem by discarding
PIDFD_THREAD notification on premature thread-group leader exit.

If that works correctly then no exit notifications are generated for a
PIDFD_THREAD pidfd for a thread-group leader until all subthreads have
been reaped. If a subthread should exec aftewards no exit notification
will be generated until that task exits or it creates subthreads and
repeates the cycle.

Co-Developed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc7, v6.14-rc6
# 45135229 05-Mar-2025 Christian Brauner <[email protected]>

pidfs: record exit code and cgroupid at exit

Record the exit code and cgroupid in release_task() and stash in struct
pidfs_exit_info so it can be retrieved even after the task has been
reaped.

Link

pidfs: record exit code and cgroupid at exit

Record the exit code and cgroupid in release_task() and stash in struct
pidfs_exit_info so it can be retrieved even after the task has been
reaped.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2
# 7903f907 06-Feb-2025 Mateusz Guzik <[email protected]>

pid: perform free_pid() calls outside of tasklist_lock

As the clone side already executes pid allocation with only pidmap_lock
held, issuing free_pid() while still holding tasklist_lock exacerbates

pid: perform free_pid() calls outside of tasklist_lock

As the clone side already executes pid allocation with only pidmap_lock
held, issuing free_pid() while still holding tasklist_lock exacerbates
total hold time of the latter.

More things may show up later which require initial clean up with the
lock held and allow finishing without it. For that reason a struct to
collect such work is added instead of merely passing the pid array.

Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: "Liam R. Howlett" <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 6731cd97 06-Feb-2025 Mateusz Guzik <[email protected]>

exit: hoist get_pid() in release_task() outside of tasklist_lock

Reduces hold time as get_pid() contains an atomic.

Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Mateusz Guzik <mjguzi

exit: hoist get_pid() in release_task() outside of tasklist_lock

Reduces hold time as get_pid() contains an atomic.

Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: "Liam R. Howlett" <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 1ab27856 06-Feb-2025 Mateusz Guzik <[email protected]>

exit: perform add_device_randomness() without tasklist_lock

Parallel calls to add_device_randomness() contend on their own.

The clone side aleady runs outside of tasklist_lock, which in turn means

exit: perform add_device_randomness() without tasklist_lock

Parallel calls to add_device_randomness() contend on their own.

The clone side aleady runs outside of tasklist_lock, which in turn means
any caller on the exit side extends the tasklist_lock hold time while
contending on the random-private lock.

Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: "Liam R. Howlett" <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 43966114 06-Feb-2025 Oleg Nesterov <[email protected]>

exit: kill the pointless __exit_signal()->clear_tsk_thread_flag(TIF_SIGPENDING)

It predates the git history and most probably it was never needed. It
doesn't really hurt, but it looks confusing beca

exit: kill the pointless __exit_signal()->clear_tsk_thread_flag(TIF_SIGPENDING)

It predates the git history and most probably it was never needed. It
doesn't really hurt, but it looks confusing because its purpose is not
clear at all.

release_task(p) is called when this task has already passed exit_notify()
so signal_pending(p) == T shouldn't make any difference.

And even _if_ there were a subtle reason to clear TIF_SIGPENDING after
exit_notify(), this clear_tsk_thread_flag() can't help anyway. If the
exiting task is a group leader or if it is ptraced, release_task() will
be likely called when this task has already done its last schedule() from
do_task_dead().

Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# fb3bbcfe 06-Feb-2025 Oleg Nesterov <[email protected]>

exit: change the release_task() paths to call flush_sigqueue() lockless

A task can block a signal, accumulate up to RLIMIT_SIGPENDING sigqueues,
and exit. In this case __exit_signal()->flush_sigqueu

exit: change the release_task() paths to call flush_sigqueue() lockless

A task can block a signal, accumulate up to RLIMIT_SIGPENDING sigqueues,
and exit. In this case __exit_signal()->flush_sigqueue() called with irqs
disabled can trigger a hard lockup, see
https://lore.kernel.org/all/[email protected]/

Fortunately, after the recent posixtimer changes sys_timer_delete() paths
no longer try to clear SIGQUEUE_PREALLOC and/or free tmr->sigq, and after
the exiting task passes __exit_signal() lock_task_sighand() can't succeed
and pid_task(tmr->it_pid) will return NULL.

This means that after __exit_signal(tsk) nobody can play with tsk->pending
or (if group_dead) with tsk->signal->shared_pending, so release_task() can
safely call flush_sigqueue() after write_unlock_irq(&tasklist_lock).

TODO:
- we can probably shift posix_cpu_timers_exit() as well
- do_sigaction() can hit the similar problem

Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc1
# 1751f872 28-Jan-2025 Joel Granados <[email protected]>

treewide: const qualify ctl_tables where applicable

Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysc

treewide: const qualify ctl_tables where applicable

Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
virtual patch

@
depends on !(file in "net")
disable optional_qualifier
@

identifier table_name != {
watchdog_hardlockup_sysctl,
iwcm_ctl_table,
ucma_ctl_table,
memory_allocation_profiling_sysctls,
loadpin_sysctl_table
};
@@

+ const
struct ctl_table table_name [] = { ... };

sed:
sed --in-place \
-e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
kernel/utsname_sysctl.c

Reviewed-by: Song Liu <[email protected]>
Acked-by: Steven Rostedt (Google) <[email protected]> # for kernel/trace/
Reviewed-by: Martin K. Petersen <[email protected]> # SCSI
Reviewed-by: Darrick J. Wong <[email protected]> # xfs
Acked-by: Jani Nikula <[email protected]>
Acked-by: Corey Minyard <[email protected]>
Acked-by: Wei Liu <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Reviewed-by: Bill O'Donnell <[email protected]>
Acked-by: Baoquan He <[email protected]>
Acked-by: Ashutosh Dixit <[email protected]>
Acked-by: Anna Schumaker <[email protected]>
Signed-off-by: Joel Granados <[email protected]>

show more ...


Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3
# be5498ca 03-Jun-2024 Al Viro <[email protected]>

remove pointless includes of <linux/fdtable.h>

some of those used to be needed, some had been cargo-culted for
no reason...

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Vir

remove pointless includes of <linux/fdtable.h>

some of those used to be needed, some had been cargo-culted for
no reason...

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


# fbe76a65 24-Jul-2024 Pasha Tatashin <[email protected]>

task_stack: uninline stack_not_used

Given that stack_not_used() is not performance critical function
uninline it.

Link: https://lkml.kernel.org/r/[email protected]
L

task_stack: uninline stack_not_used

Given that stack_not_used() is not performance critical function
uninline it.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pasha Tatashin <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Li Zhijian <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# c4a6fce8 24-Jul-2024 Pasha Tatashin <[email protected]>

vmstat: kernel stack usage histogram

As part of the dynamic kernel stack project, we need to know the amount of
data that can be saved by reducing the default kernel stack size [1].

Provide a kerne

vmstat: kernel stack usage histogram

As part of the dynamic kernel stack project, we need to know the amount of
data that can be saved by reducing the default kernel stack size [1].

Provide a kernel stack usage histogram to aid in optimizing kernel stack
sizes and minimizing memory waste in large-scale environments. The
histogram divides stack usage into power-of-two buckets and reports the
results in /proc/vmstat. This information is especially valuable in
environments with millions of machines, where even small optimizations can
have a significant impact.

The histogram data is presented in /proc/vmstat with entries like
"kstack_1k", "kstack_2k", and so on, indicating the number of threads that
exited with stack usage falling within each respective bucket.

Example outputs:
Intel:
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0

ARM with 64K page_size:
$ grep kstack /proc/vmstat
kstack_1k 1
kstack_2k 340
kstack_4k 25212
kstack_8k 1659
kstack_16k 0
kstack_32k 0
kstack_64k 0

Note: once the dynamic kernel stack is implemented it will depend on the
implementation the usability of this feature: On hardware that supports
faults on kernel stacks, we will have other metrics that show the total
number of pages allocated for stacks. On hardware where faults are not
supported, we will most likely have some optimization where only some
threads are extended, and for those, these metrics will still be very
useful.

[1] https://lwn.net/Articles/974367

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pasha Tatashin <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Li Zhijian <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# b8e75312 24-Jul-2024 Paul E. McKenney <[email protected]>

exit: Sleep at TASK_IDLE when waiting for application core dump

Currently, the coredump_task_exit() function sets the task state
to TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well.
But

exit: Sleep at TASK_IDLE when waiting for application core dump

Currently, the coredump_task_exit() function sets the task state
to TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well.
But a combination of large memory and slow (and/or highly contended)
mass storage can cause application core dumps to take more than
two minutes, which can cause check_hung_task(), which is invoked by
check_hung_uninterruptible_tasks(), to produce task-blocked splats.
There does not seem to be any reasonable benefit to getting these splats.

Furthermore, as Oleg Nesterov points out, TASK_UNINTERRUPTIBLE could
be misleading because the task sleeping in coredump_task_exit() really
is killable, albeit indirectly. See the check of signal->core_state
in prepare_signal() and the check of fatal_signal_pending()
in dump_interrupted(), which bypass the normal unkillability of
TASK_UNINTERRUPTIBLE, resulting in coredump_finish() invoking
wake_up_process() on any threads sleeping in coredump_task_exit().

Therefore, change that TASK_UNINTERRUPTIBLE to TASK_IDLE.

Reported-by: Anhad Jai Singh <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Rik van Riel <[email protected]>

show more ...


# d73d0035 26-Jun-2024 Oleg Nesterov <[email protected]>

memcg: mm_update_next_owner: move for_each_thread() into try_to_set_owner()

mm_update_next_owner() checks the children / real_parent->children to
avoid the "everything else" loop in the likely case,

memcg: mm_update_next_owner: move for_each_thread() into try_to_set_owner()

mm_update_next_owner() checks the children / real_parent->children to
avoid the "everything else" loop in the likely case, but this won't work
if a child/sibling has a zombie leader with ->mm == NULL.

Move the for_each_thread() logic into try_to_set_owner(), if nothing else
this makes the children/siblings/everything searches more consistent.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jinliang Zheng <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Tycho Andersen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 2a22b773 26-Jun-2024 Oleg Nesterov <[email protected]>

memcg: mm_update_next_owner: kill the "retry" logic

Add the new helper, try_to_set_owner(), which tries to update mm->owner
once we see c->mm == mm. This way mm_update_next_owner() doesn't need to

memcg: mm_update_next_owner: kill the "retry" logic

Add the new helper, try_to_set_owner(), which tries to update mm->owner
once we see c->mm == mm. This way mm_update_next_owner() doesn't need to
restart the list_for_each_entry/for_each_process loops from the very
beginning if it races with exit/exec, it can just continue.

Unlike the current code, try_to_set_owner() re-checks tsk->mm == mm before
it drops tasklist_lock, so it doesn't need get/put_task_struct().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jinliang Zheng <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Tycho Andersen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 76ba6acf 20-Jun-2024 Jinliang Zheng <[email protected]>

mm: optimize the redundant loop of mm_update_owner_next()

When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible t

mm: optimize the redundant loop of mm_update_owner_next()

When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.

If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.

Recognize this situation in advance and exit early.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jinliang Zheng <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# cf3f9a59 20-Jun-2024 Jinliang Zheng <[email protected]>

mm: optimize the redundant loop of mm_update_owner_next()

When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible t

mm: optimize the redundant loop of mm_update_owner_next()

When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.

If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.

Recognize this situation in advance and exit early.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jinliang Zheng <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# bfafe5ef 28-Jun-2024 Andrei Vagin <[email protected]>

seccomp: release task filters when the task exits

Previously, seccomp filters were released in release_task(), which
required the process to exit and its zombie to be collected. However,
exited thre

seccomp: release task filters when the task exits

Previously, seccomp filters were released in release_task(), which
required the process to exit and its zombie to be collected. However,
exited threads/processes can't trigger any seccomp events, making it
more logical to release filters upon task exits.

This adjustment simplifies scenarios where a parent is tracing its child
process. The parent process can now handle all events from a seccomp
listening descriptor and then call wait to collect a child zombie.

seccomp_filter_release takes the siglock to avoid races with
seccomp_sync_threads. There was an idea to bypass taking the lock by
checking PF_EXITING, but it can be set without holding siglock if
threads have SIGNAL_GROUP_EXIT. This means it can happen concurently
with seccomp_filter_release.

This change also fixes another minor problem. Suppose that a group
leader installs the new filter without SECCOMP_FILTER_FLAG_TSYNC, exits,
and becomes a zombie. Without this change, SECCOMP_FILTER_FLAG_TSYNC
from any other thread can never succeed, seccomp_can_sync_threads() will
check a zombie leader and is_ancestor() will fail.

Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Tycho Andersen <[email protected]>
Signed-off-by: Kees Cook <[email protected]>

show more ...


Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1
# 240a1853 16-Mar-2024 Mike Christie <[email protected]>

kernel: Remove signal hacks for vhost_tasks

This removes the signal/coredump hacks added for vhost_tasks in:

Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")

When

kernel: Remove signal hacks for vhost_tasks

This removes the signal/coredump hacks added for vhost_tasks in:

Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")

When that patch was added vhost_tasks did not handle SIGKILL and would
try to ignore/clear the signal and continue on until the device's close
function was called. In the previous patches vhost_tasks and the vhost
drivers were converted to support SIGKILL by cleaning themselves up and
exiting. The hacks are no longer needed so this removes them.

Signed-off-by: Mike Christie <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

show more ...


Revision tags: v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1
# 11a92190 27-Jun-2023 Joel Granados <[email protected]>

kernel misc: Remove the now superfluous sentinel elements from ctl_table array

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (

kernel misc: Remove the now superfluous sentinel elements from ctl_table array

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%[email protected]/)

Remove the sentinel from ctl_table arrays. Reduce by one the values used
to compare the size of the adjusted arrays.

Signed-off-by: Joel Granados <[email protected]>

show more ...


# c1be35a1 23-Jan-2024 Oleg Nesterov <[email protected]>

exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)

After the recent changes nobody use siglock to read the values protected
by stats_lock, we can kill spin_lock_irq(&current

exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)

After the recent changes nobody use siglock to read the values protected
by stats_lock, we can kill spin_lock_irq(&current->sighand->siglock) and
update the comment.

With this patch only __exit_signal() and thread_group_start_cputime() take
stats_lock under siglock.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Dylan Hatch <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# e2e8a142 05-Feb-2024 Oleg Nesterov <[email protected]>

pidfd: exit: kill the no longer used thread_group_exited()

It was used by pidfd_poll() but now it has no callers.

If it finally finds a modular user we can revert this change, but note
that the com

pidfd: exit: kill the no longer used thread_group_exited()

It was used by pidfd_poll() but now it has no callers.

If it finally finds a modular user we can revert this change, but note
that the comment above this helper and the changelog in 38fd525a4c61
("exit: Factor thread_group_exited out of pidfd_poll") are not accurate,
thread_group_exited() won't return true if all other threads have passed
exit_notify() and are zombies, it returns true only when all other threads
are completely gone. Not to mention that it can only work if the task
identified by @pid is a thread-group leader.

Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Tycho Andersen <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 64bef697 31-Jan-2024 Oleg Nesterov <[email protected]>

pidfd: implement PIDFD_THREAD flag for pidfd_open()

With this flag:

- pidfd_open() doesn't require that the target task must be
a thread-group leader

- pidfd_poll() succeeds when the task exi

pidfd: implement PIDFD_THREAD flag for pidfd_open()

With this flag:

- pidfd_open() doesn't require that the target task must be
a thread-group leader

- pidfd_poll() succeeds when the task exits and becomes a
zombie (iow, passes exit_notify()), even if it is a leader
and thread-group is not empty.

This means that the behaviour of pidfd_poll(PIDFD_THREAD,
pid-of-group-leader) is not well defined if it races with
exec() from its sub-thread; pidfd_poll() can succeed or not
depending on whether pidfd_task_exited() is called before
or after exchange_tids().

Perhaps we can improve this behaviour later, pidfd_poll()
can probably take sig->group_exec_task into account. But
this doesn't really differ from the case when the leader
exits before other threads (so pidfd_poll() succeeds) and
then another thread execs and pidfd_poll() will block again.

thread_group_exited() is no longer used, perhaps it can die.

Co-developed-by: Tycho Andersen <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Tycho Andersen <[email protected]>
Reviewed-by: Tycho Andersen <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 6dfeff09 11-Dec-2023 Matthew Wilcox (Oracle) <[email protected]>

wait: Remove uapi header file from main header file

There's really no overlap between uapi/linux/wait.h and linux/wait.h.
There are two files which rely on the uapi file being implcitly included,
so

wait: Remove uapi header file from main header file

There's really no overlap between uapi/linux/wait.h and linux/wait.h.
There are two files which rely on the uapi file being implcitly included,
so explicitly include it there and remove it from the main header file.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>

show more ...


12345678910>>...26