|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
9133607d |
| 24-Mar-2025 |
Oleg Nesterov <[email protected]> |
exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit()
Consider a process with a group leader L and a sub-thread T. L does sys_exit(1), then T does sys_exit_grou
exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit()
Consider a process with a group leader L and a sub-thread T. L does sys_exit(1), then T does sys_exit_group(2).
In this case wait_task_zombie(L) will notice SIGNAL_GROUP_EXIT and use L->signal->group_exit_code, this is correct.
But, before that, do_notify_parent(L) called by release_task(T) will use L->exit_code != L->signal->group_exit_code, and this is not consistent. We don't really care, I think that nobody relies on the info which comes with SIGCHLD, if nothing else SIGCHLD < SIGRTMIN can be queued only once.
But pidfs_exit() is more problematic, I think pidfs_exit_info->exit_code should report ->group_exit_code in this case, just like wait_task_zombie().
TODO: with this change we can hopefully cleanup (or may be even kill) the similar SIGNAL_GROUP_EXIT checks, at least in wait_task_zombie().
Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14 |
|
| #
0b7747a5 |
| 23-Mar-2025 |
Oleg Nesterov <[email protected]> |
pidfs: cleanup the usage of do_notify_pidfd()
If a single-threaded process exits do_notify_pidfd() will be called twice, from exit_notify() and right after that from do_notify_parent().
1. Change e
pidfs: cleanup the usage of do_notify_pidfd()
If a single-threaded process exits do_notify_pidfd() will be called twice, from exit_notify() and right after that from do_notify_parent().
1. Change exit_notify() to call do_notify_pidfd() if the exiting task is not ptraced and it is not a group leader.
2. Change do_notify_parent() to call do_notify_pidfd() unconditionally.
If tsk is not ptraced, do_notify_parent() will only be called when it is a group-leader and thread_group_empty() is true.
This means that if tsk is ptraced, do_notify_pidfd() will be called from do_notify_parent() even if tsk is a delay_group_leader(). But this case is less common, and apart from the unnecessary __wake_up() is harmless.
Granted, this unnecessary __wake_up() can be avoided, but I don't want to do it in this patch because it's just a consequence of another historical oddity: we notify the tracer even if !thread_group_empty(), but do_wait() from debugger can't work until all other threads exit. With or without this patch we should either eliminate do_notify_parent() in this case, or change do_wait(WEXITED) to untrace the ptraced delay_group_leader() at least when ptrace_reparented().
Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
0fb48272 |
| 20-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: improve multi-threaded exec and premature thread-group leader exit polling
This is another attempt trying to make pidfd polling for multi-threaded exec and premature thread-group leader exit
pidfs: improve multi-threaded exec and premature thread-group leader exit polling
This is another attempt trying to make pidfd polling for multi-threaded exec and premature thread-group leader exit consistent.
A quick recap of these two cases:
(1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs.
(2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited.
Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader.
The correct behavior would be to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive.
But this is difficult to handle because a thread-group may exit prematurely as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point.
So far there was no way to distinguish between (1) and (2) internally. This tiny series tries to address this problem by discarding PIDFD_THREAD notification on premature thread-group leader exit.
If that works correctly then no exit notifications are generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec aftewards no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle.
Co-Developed-by: Oleg Nesterov <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6 |
|
| #
45135229 |
| 05-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: record exit code and cgroupid at exit
Record the exit code and cgroupid in release_task() and stash in struct pidfs_exit_info so it can be retrieved even after the task has been reaped.
Link
pidfs: record exit code and cgroupid at exit
Record the exit code and cgroupid in release_task() and stash in struct pidfs_exit_info so it can be retrieved even after the task has been reaped.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
7903f907 |
| 06-Feb-2025 |
Mateusz Guzik <[email protected]> |
pid: perform free_pid() calls outside of tasklist_lock
As the clone side already executes pid allocation with only pidmap_lock held, issuing free_pid() while still holding tasklist_lock exacerbates
pid: perform free_pid() calls outside of tasklist_lock
As the clone side already executes pid allocation with only pidmap_lock held, issuing free_pid() while still holding tasklist_lock exacerbates total hold time of the latter.
More things may show up later which require initial clean up with the lock held and allow finishing without it. For that reason a struct to collect such work is added instead of merely passing the pid array.
Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: "Liam R. Howlett" <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
6731cd97 |
| 06-Feb-2025 |
Mateusz Guzik <[email protected]> |
exit: hoist get_pid() in release_task() outside of tasklist_lock
Reduces hold time as get_pid() contains an atomic.
Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Mateusz Guzik <mjguzi
exit: hoist get_pid() in release_task() outside of tasklist_lock
Reduces hold time as get_pid() contains an atomic.
Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: "Liam R. Howlett" <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
1ab27856 |
| 06-Feb-2025 |
Mateusz Guzik <[email protected]> |
exit: perform add_device_randomness() without tasklist_lock
Parallel calls to add_device_randomness() contend on their own.
The clone side aleady runs outside of tasklist_lock, which in turn means
exit: perform add_device_randomness() without tasklist_lock
Parallel calls to add_device_randomness() contend on their own.
The clone side aleady runs outside of tasklist_lock, which in turn means any caller on the exit side extends the tasklist_lock hold time while contending on the random-private lock.
Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: "Liam R. Howlett" <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
43966114 |
| 06-Feb-2025 |
Oleg Nesterov <[email protected]> |
exit: kill the pointless __exit_signal()->clear_tsk_thread_flag(TIF_SIGPENDING)
It predates the git history and most probably it was never needed. It doesn't really hurt, but it looks confusing beca
exit: kill the pointless __exit_signal()->clear_tsk_thread_flag(TIF_SIGPENDING)
It predates the git history and most probably it was never needed. It doesn't really hurt, but it looks confusing because its purpose is not clear at all.
release_task(p) is called when this task has already passed exit_notify() so signal_pending(p) == T shouldn't make any difference.
And even _if_ there were a subtle reason to clear TIF_SIGPENDING after exit_notify(), this clear_tsk_thread_flag() can't help anyway. If the exiting task is a group leader or if it is ptraced, release_task() will be likely called when this task has already done its last schedule() from do_task_dead().
Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: Frederic Weisbecker <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
fb3bbcfe |
| 06-Feb-2025 |
Oleg Nesterov <[email protected]> |
exit: change the release_task() paths to call flush_sigqueue() lockless
A task can block a signal, accumulate up to RLIMIT_SIGPENDING sigqueues, and exit. In this case __exit_signal()->flush_sigqueu
exit: change the release_task() paths to call flush_sigqueue() lockless
A task can block a signal, accumulate up to RLIMIT_SIGPENDING sigqueues, and exit. In this case __exit_signal()->flush_sigqueue() called with irqs disabled can trigger a hard lockup, see https://lore.kernel.org/all/[email protected]/
Fortunately, after the recent posixtimer changes sys_timer_delete() paths no longer try to clear SIGQUEUE_PREALLOC and/or free tmr->sigq, and after the exiting task passes __exit_signal() lock_task_sighand() can't succeed and pid_task(tmr->it_pid) will return NULL.
This means that after __exit_signal(tsk) nobody can play with tsk->pending or (if group_dead) with tsk->signal->shared_pending, so release_task() can safely call flush_sigqueue() after write_unlock_irq(&tasklist_lock).
TODO: - we can probably shift posix_cpu_timers_exit() as well - do_sigaction() can hit the similar problem
Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Frederic Weisbecker <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1 |
|
| #
1751f872 |
| 28-Jan-2025 |
Joel Granados <[email protected]> |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysc
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers.
Created this by running an spatch followed by a sed command: Spatch: virtual patch
@ depends on !(file in "net") disable optional_qualifier @
identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@
+ const struct ctl_table table_name [] = { ... };
sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c
Reviewed-by: Song Liu <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> # for kernel/trace/ Reviewed-by: Martin K. Petersen <[email protected]> # SCSI Reviewed-by: Darrick J. Wong <[email protected]> # xfs Acked-by: Jani Nikula <[email protected]> Acked-by: Corey Minyard <[email protected]> Acked-by: Wei Liu <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Reviewed-by: Bill O'Donnell <[email protected]> Acked-by: Baoquan He <[email protected]> Acked-by: Ashutosh Dixit <[email protected]> Acked-by: Anna Schumaker <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3 |
|
| #
be5498ca |
| 03-Jun-2024 |
Al Viro <[email protected]> |
remove pointless includes of <linux/fdtable.h>
some of those used to be needed, some had been cargo-culted for no reason...
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Vir
remove pointless includes of <linux/fdtable.h>
some of those used to be needed, some had been cargo-culted for no reason...
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
fbe76a65 |
| 24-Jul-2024 |
Pasha Tatashin <[email protected]> |
task_stack: uninline stack_not_used
Given that stack_not_used() is not performance critical function uninline it.
Link: https://lkml.kernel.org/r/[email protected] L
task_stack: uninline stack_not_used
Given that stack_not_used() is not performance critical function uninline it.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
c4a6fce8 |
| 24-Jul-2024 |
Pasha Tatashin <[email protected]> |
vmstat: kernel stack usage histogram
As part of the dynamic kernel stack project, we need to know the amount of data that can be saved by reducing the default kernel stack size [1].
Provide a kerne
vmstat: kernel stack usage histogram
As part of the dynamic kernel stack project, we need to know the amount of data that can be saved by reducing the default kernel stack size [1].
Provide a kernel stack usage histogram to aid in optimizing kernel stack sizes and minimizing memory waste in large-scale environments. The histogram divides stack usage into power-of-two buckets and reports the results in /proc/vmstat. This information is especially valuable in environments with millions of machines, where even small optimizations can have a significant impact.
The histogram data is presented in /proc/vmstat with entries like "kstack_1k", "kstack_2k", and so on, indicating the number of threads that exited with stack usage falling within each respective bucket.
Example outputs: Intel: $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0
ARM with 64K page_size: $ grep kstack /proc/vmstat kstack_1k 1 kstack_2k 340 kstack_4k 25212 kstack_8k 1659 kstack_16k 0 kstack_32k 0 kstack_64k 0
Note: once the dynamic kernel stack is implemented it will depend on the implementation the usability of this feature: On hardware that supports faults on kernel stacks, we will have other metrics that show the total number of pages allocated for stacks. On hardware where faults are not supported, we will most likely have some optimization where only some threads are extended, and for those, these metrics will still be very useful.
[1] https://lwn.net/Articles/974367
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Reviewed-by: Kent Overstreet <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
b8e75312 |
| 24-Jul-2024 |
Paul E. McKenney <[email protected]> |
exit: Sleep at TASK_IDLE when waiting for application core dump
Currently, the coredump_task_exit() function sets the task state to TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well. But
exit: Sleep at TASK_IDLE when waiting for application core dump
Currently, the coredump_task_exit() function sets the task state to TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well. But a combination of large memory and slow (and/or highly contended) mass storage can cause application core dumps to take more than two minutes, which can cause check_hung_task(), which is invoked by check_hung_uninterruptible_tasks(), to produce task-blocked splats. There does not seem to be any reasonable benefit to getting these splats.
Furthermore, as Oleg Nesterov points out, TASK_UNINTERRUPTIBLE could be misleading because the task sleeping in coredump_task_exit() really is killable, albeit indirectly. See the check of signal->core_state in prepare_signal() and the check of fatal_signal_pending() in dump_interrupted(), which bypass the normal unkillability of TASK_UNINTERRUPTIBLE, resulting in coredump_finish() invoking wake_up_process() on any threads sleeping in coredump_task_exit().
Therefore, change that TASK_UNINTERRUPTIBLE to TASK_IDLE.
Reported-by: Anhad Jai Singh <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Andrew Morton <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Chris Mason <[email protected]> Cc: Rik van Riel <[email protected]>
show more ...
|
| #
d73d0035 |
| 26-Jun-2024 |
Oleg Nesterov <[email protected]> |
memcg: mm_update_next_owner: move for_each_thread() into try_to_set_owner()
mm_update_next_owner() checks the children / real_parent->children to avoid the "everything else" loop in the likely case,
memcg: mm_update_next_owner: move for_each_thread() into try_to_set_owner()
mm_update_next_owner() checks the children / real_parent->children to avoid the "everything else" loop in the likely case, but this won't work if a child/sibling has a zombie leader with ->mm == NULL.
Move the for_each_thread() logic into try_to_set_owner(), if nothing else this makes the children/siblings/everything searches more consistent.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jinliang Zheng <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Tycho Andersen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
2a22b773 |
| 26-Jun-2024 |
Oleg Nesterov <[email protected]> |
memcg: mm_update_next_owner: kill the "retry" logic
Add the new helper, try_to_set_owner(), which tries to update mm->owner once we see c->mm == mm. This way mm_update_next_owner() doesn't need to
memcg: mm_update_next_owner: kill the "retry" logic
Add the new helper, try_to_set_owner(), which tries to update mm->owner once we see c->mm == mm. This way mm_update_next_owner() doesn't need to restart the list_for_each_entry/for_each_process loops from the very beginning if it races with exit/exec, it can just continue.
Unlike the current code, try_to_set_owner() re-checks tsk->mm == mm before it drops tasklist_lock, so it doesn't need get/put_task_struct().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jinliang Zheng <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Tycho Andersen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
76ba6acf |
| 20-Jun-2024 |
Jinliang Zheng <[email protected]> |
mm: optimize the redundant loop of mm_update_owner_next()
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or /proc or ptrace or page migration (get_task_mm()), it is impossible t
mm: optimize the redundant loop of mm_update_owner_next()
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or /proc or ptrace or page migration (get_task_mm()), it is impossible to find an appropriate task_struct in the loop whose mm_struct is the same as the target mm_struct.
If the above race condition is combined with the stress-ng-zombie and stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in write_lock_irq() for tasklist_lock.
Recognize this situation in advance and exit early.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinliang Zheng <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tycho Andersen <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
cf3f9a59 |
| 20-Jun-2024 |
Jinliang Zheng <[email protected]> |
mm: optimize the redundant loop of mm_update_owner_next()
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or /proc or ptrace or page migration (get_task_mm()), it is impossible t
mm: optimize the redundant loop of mm_update_owner_next()
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or /proc or ptrace or page migration (get_task_mm()), it is impossible to find an appropriate task_struct in the loop whose mm_struct is the same as the target mm_struct.
If the above race condition is combined with the stress-ng-zombie and stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in write_lock_irq() for tasklist_lock.
Recognize this situation in advance and exit early.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinliang Zheng <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tycho Andersen <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
bfafe5ef |
| 28-Jun-2024 |
Andrei Vagin <[email protected]> |
seccomp: release task filters when the task exits
Previously, seccomp filters were released in release_task(), which required the process to exit and its zombie to be collected. However, exited thre
seccomp: release task filters when the task exits
Previously, seccomp filters were released in release_task(), which required the process to exit and its zombie to be collected. However, exited threads/processes can't trigger any seccomp events, making it more logical to release filters upon task exits.
This adjustment simplifies scenarios where a parent is tracing its child process. The parent process can now handle all events from a seccomp listening descriptor and then call wait to collect a child zombie.
seccomp_filter_release takes the siglock to avoid races with seccomp_sync_threads. There was an idea to bypass taking the lock by checking PF_EXITING, but it can be set without holding siglock if threads have SIGNAL_GROUP_EXIT. This means it can happen concurently with seccomp_filter_release.
This change also fixes another minor problem. Suppose that a group leader installs the new filter without SECCOMP_FILTER_FLAG_TSYNC, exits, and becomes a zombie. Without this change, SECCOMP_FILTER_FLAG_TSYNC from any other thread can never succeed, seccomp_can_sync_threads() will check a zombie leader and is_ancestor() will fail.
Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Tycho Andersen <[email protected]> Signed-off-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
240a1853 |
| 16-Mar-2024 |
Mike Christie <[email protected]> |
kernel: Remove signal hacks for vhost_tasks
This removes the signal/coredump hacks added for vhost_tasks in:
Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
When
kernel: Remove signal hacks for vhost_tasks
This removes the signal/coredump hacks added for vhost_tasks in:
Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
When that patch was added vhost_tasks did not handle SIGKILL and would try to ignore/clear the signal and continue on until the device's close function was called. In the previous patches vhost_tasks and the vhost drivers were converted to support SIGKILL by cleaning themselves up and exiting. The hacks are no longer needed so this removes them.
Signed-off-by: Mike Christie <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1 |
|
| #
11a92190 |
| 27-Jun-2023 |
Joel Granados <[email protected]> |
kernel misc: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (
kernel misc: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%[email protected]/)
Remove the sentinel from ctl_table arrays. Reduce by one the values used to compare the size of the adjusted arrays.
Signed-off-by: Joel Granados <[email protected]>
show more ...
|
| #
c1be35a1 |
| 23-Jan-2024 |
Oleg Nesterov <[email protected]> |
exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)
After the recent changes nobody use siglock to read the values protected by stats_lock, we can kill spin_lock_irq(¤t
exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)
After the recent changes nobody use siglock to read the values protected by stats_lock, we can kill spin_lock_irq(¤t->sighand->siglock) and update the comment.
With this patch only __exit_signal() and thread_group_start_cputime() take stats_lock under siglock.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Dylan Hatch <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
e2e8a142 |
| 05-Feb-2024 |
Oleg Nesterov <[email protected]> |
pidfd: exit: kill the no longer used thread_group_exited()
It was used by pidfd_poll() but now it has no callers.
If it finally finds a modular user we can revert this change, but note that the com
pidfd: exit: kill the no longer used thread_group_exited()
It was used by pidfd_poll() but now it has no callers.
If it finally finds a modular user we can revert this change, but note that the comment above this helper and the changelog in 38fd525a4c61 ("exit: Factor thread_group_exited out of pidfd_poll") are not accurate, thread_group_exited() won't return true if all other threads have passed exit_notify() and are zombies, it returns true only when all other threads are completely gone. Not to mention that it can only work if the task identified by @pid is a thread-group leader.
Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Tycho Andersen <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
64bef697 |
| 31-Jan-2024 |
Oleg Nesterov <[email protected]> |
pidfd: implement PIDFD_THREAD flag for pidfd_open()
With this flag:
- pidfd_open() doesn't require that the target task must be a thread-group leader
- pidfd_poll() succeeds when the task exi
pidfd: implement PIDFD_THREAD flag for pidfd_open()
With this flag:
- pidfd_open() doesn't require that the target task must be a thread-group leader
- pidfd_poll() succeeds when the task exits and becomes a zombie (iow, passes exit_notify()), even if it is a leader and thread-group is not empty.
This means that the behaviour of pidfd_poll(PIDFD_THREAD, pid-of-group-leader) is not well defined if it races with exec() from its sub-thread; pidfd_poll() can succeed or not depending on whether pidfd_task_exited() is called before or after exchange_tids().
Perhaps we can improve this behaviour later, pidfd_poll() can probably take sig->group_exec_task into account. But this doesn't really differ from the case when the leader exits before other threads (so pidfd_poll() succeeds) and then another thread execs and pidfd_poll() will block again.
thread_group_exited() is no longer used, perhaps it can die.
Co-developed-by: Tycho Andersen <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Tested-by: Tycho Andersen <[email protected]> Reviewed-by: Tycho Andersen <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
6dfeff09 |
| 11-Dec-2023 |
Matthew Wilcox (Oracle) <[email protected]> |
wait: Remove uapi header file from main header file
There's really no overlap between uapi/linux/wait.h and linux/wait.h. There are two files which rely on the uapi file being implcitly included, so
wait: Remove uapi header file from main header file
There's really no overlap between uapi/linux/wait.h and linux/wait.h. There are two files which rely on the uapi file being implcitly included, so explicitly include it there and remove it from the main header file.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Kent Overstreet <[email protected]> Reviewed-by: Christian Brauner <[email protected]>
show more ...
|