|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
b69bb476 |
| 31-Jan-2025 |
Shakeel Butt <[email protected]> |
cgroup: fix race between fork and cgroup.kill
Tejun reported the following race between fork() and cgroup.kill at [1].
Tejun: I was looking at cgroup.kill implementation and wondering whether the
cgroup: fix race between fork and cgroup.kill
Tejun reported the following race between fork() and cgroup.kill at [1].
Tejun: I was looking at cgroup.kill implementation and wondering whether there could be a race window. So, __cgroup_kill() does the following:
k1. Set CGRP_KILL. k2. Iterate tasks and deliver SIGKILL. k3. Clear CGRP_KILL.
The copy_process() does the following:
c1. Copy a bunch of stuff. c2. Grab siglock. c3. Check fatal_signal_pending(). c4. Commit to forking. c5. Release siglock. c6. Call cgroup_post_fork() which puts the task on the css_set and tests CGRP_KILL.
The intention seems to be that either a forking task gets SIGKILL and terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I don't see what guarantees that k3 can't happen before c6. ie. After a forking task passes c5, k2 can take place and then before the forking task reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child. What am I missing?
This is indeed a race. One way to fix this race is by taking cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork() side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork() to cgroup_post_fork(). However that would be heavy handed as this adds one more potential stall scenario for cgroup.kill which is usually called under extreme situation like memory pressure.
To fix this race, let's maintain a sequence number per cgroup which gets incremented on __cgroup_kill() call. On the fork() side, the cgroup_can_fork() will cache the sequence number locally and recheck it against the cgroup's sequence number at cgroup_post_fork() site. If the sequence numbers mismatch, it means __cgroup_kill() can been called and we should send SIGKILL to the newly created task.
Reported-by: Tejun Heo <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ [1] Fixes: 661ee6280931 ("cgroup: introduce cgroup.kill") Cc: [email protected] # v5.14+ Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6 |
|
| #
a8532fac |
| 31-Aug-2024 |
Tejun Heo <[email protected]> |
sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable
During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task() on every task. To do this, it does get_task_struct() on
sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable
During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task() on every task. To do this, it does get_task_struct() on each iterated task, drop the lock and then call ops.init_task().
However, a TASK_DEAD task may already have lost all its usage count and be waiting for RCU grace period to be freed. If get_task_struct() is called on such task, use-after-free can happen. To avoid such situations, scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe as they are never going to be scheduled again.
Unfortunately, a racing sched_setscheduler(2) can grab the task before the task is unhashed and then continue to e.g. move the task from RT to SCX after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't gone through scx_ops_init_task(), scx_ops_enable_task() called from switching_to_scx() triggers the following warning:
sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872] WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0 ... RIP: 0010:scx_ops_enable_task+0x18f/0x1f0 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x84e/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e
As in the ops_disable path, it just doesn't seem like a good idea to leave any task in an inconsistent state, even when the task is dead. The root cause is ops_enable not being able to tell reliably whether a task is truly dead (no one else is looking at it and it's about to be freed) and was testing TASK_DEAD instead. Fix it by testing the task's usage count directly.
- ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all tasks, @include_dead is removed from scx_task_iter_next_locked() along with dead task filtering.
- tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct() fails.
Signed-off-by: Tejun Heo <[email protected]> Cc: David Vernet <[email protected]> Cc: Peter Zijlstra <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5 |
|
| #
304b3f2b |
| 18-Jun-2024 |
Tejun Heo <[email protected]> |
sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork()
A new BPF extensible sched_class will need more control over the forking process. It wants to be able to fail from sched_cg
sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork()
A new BPF extensible sched_class will need more control over the forking process. It wants to be able to fail from sched_cgroup_fork() after the new task's sched_task_group is initialized so that the loaded BPF program can prepare the task with its cgroup association is established and reject fork if e.g. allocation fails.
Allow sched_cgroup_fork() to fail by making it return int instead of void and adding sched_cancel_fork() to undo sched_fork() in the error path.
sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any behavior changes.
v2: Patch description updated to detail the expected use.
Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: David Vernet <[email protected]> Acked-by: Josh Don <[email protected]> Acked-by: Hao Luo <[email protected]> Acked-by: Barret Rhoden <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6 |
|
| #
1e2f2d31 |
| 15-Dec-2023 |
Kent Overstreet <[email protected]> |
Kill sched.h dependency on rcupdate.h
by moving cond_resched_rcu() to rcupdate_wait.h, we can kill another big sched.h dependency.
Signed-off-by: Kent Overstreet <[email protected]>
|
| #
f9d6966b |
| 11-Dec-2023 |
Kent Overstreet <[email protected]> |
refcount: Split out refcount_types.h
More trimming of sched.h dependencies.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Revision tags: v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2 |
|
| #
5431fdd2 |
| 17-Sep-2023 |
Peter Zijlstra <[email protected]> |
ptrace: Convert ptrace_attach() to use lock guards
Created as testing for the conditional guard infrastructure. Specifically this makes use of the following form:
scoped_cond_guard (mutex_intr, r
ptrace: Convert ptrace_attach() to use lock guards
Created as testing for the conditional guard infrastructure. Specifically this makes use of the following form:
scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR, &task->signal->cred_guard_mutex) { ... } ... return 0;
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/20231102110706.568467727%40infradead.org
show more ...
|
|
Revision tags: v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7 |
|
| #
893cdaaa |
| 14-Jun-2023 |
Wander Lairson Costa <[email protected]> |
sched: avoid false lockdep splat in put_task_struct()
In put_task_struct(), a spin_lock is indirectly acquired under the kernel stock. When running the kernel in real-time (RT) configuration, the op
sched: avoid false lockdep splat in put_task_struct()
In put_task_struct(), a spin_lock is indirectly acquired under the kernel stock. When running the kernel in real-time (RT) configuration, the operation is dispatched to a preemptible context call to ensure guaranteed preemption. However, if PROVE_RAW_LOCK_NESTING is enabled and __put_task_struct() is called while holding a raw_spinlock, lockdep incorrectly reports an "Invalid lock context" in the stock kernel.
This false splat occurs because lockdep is unaware of the different route taken under RT. To address this issue, override the inner wait type to prevent the false lockdep splat.
Suggested-by: Oleg Nesterov <[email protected]> Suggested-by: Sebastian Andrzej Siewior <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Wander Lairson Costa <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
d243b344 |
| 14-Jun-2023 |
Wander Lairson Costa <[email protected]> |
kernel/fork: beware of __put_task_struct() calling context
Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping locks. Therefore, it can't be called from an non-preemptible context.
O
kernel/fork: beware of __put_task_struct() calling context
Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping locks. Therefore, it can't be called from an non-preemptible context.
One practical example is splat inside inactive_task_timer(), which is called in a interrupt context:
CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W --------- Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012 Call Trace: dump_stack_lvl+0x57/0x7d mark_lock_irq.cold+0x33/0xba mark_lock+0x1e7/0x400 mark_usage+0x11d/0x140 __lock_acquire+0x30d/0x930 lock_acquire.part.0+0x9c/0x210 rt_spin_lock+0x27/0xe0 refill_obj_stock+0x3d/0x3a0 kmem_cache_free+0x357/0x560 inactive_task_timer+0x1ad/0x340 __run_hrtimer+0x8a/0x1a0 __hrtimer_run_queues+0x91/0x130 hrtimer_interrupt+0x10f/0x220 __sysvec_apic_timer_interrupt+0x7b/0xd0 sysvec_apic_timer_interrupt+0x4f/0xd0 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0033:0x7fff196bf6f5
Instead of calling __put_task_struct() directly, we defer it using call_rcu(). A more natural approach would use a workqueue, but since in PREEMPT_RT, we can't allocate dynamic memory from atomic context, the code would become more complex because we would need to put the work_struct instance in the task_struct and initialize it when we allocate a new task_struct.
The issue is reproducible with stress-ng:
while true; do stress-ng --sched deadline --sched-period 1000000000 \ --sched-runtime 800000000 --sched-deadline \ 1000000000 --mmapfork 23 -t 20 done
Reported-by: Hu Chunyu <[email protected]> Suggested-by: Oleg Nesterov <[email protected]> Suggested-by: Valentin Schneider <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Wander Lairson Costa <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.4-rc6, v6.4-rc5, v6.4-rc4 |
|
| #
54da6a09 |
| 26-May-2023 |
Peter Zijlstra <[email protected]> |
locking: Introduce __cleanup() based infrastructure
Use __attribute__((__cleanup__(func))) to build:
- simple auto-release pointers using __free()
- 'classes' with constructor and destructor sem
locking: Introduce __cleanup() based infrastructure
Use __attribute__((__cleanup__(func))) to build:
- simple auto-release pointers using __free()
- 'classes' with constructor and destructor semantics for scope-based resource management.
- lock guards based on the above classes.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/20230612093537.614161713%40infradead.org
show more ...
|
| #
f9010dbd |
| 01-Jun-2023 |
Mike Christie <[email protected]> |
fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps o
fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them.
To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be supported.
This is a modified version of the patch written by Mike Christie <[email protected]> which was a modified version of patch originally written by Linus.
Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER. Including ignoring signals, setting up the register state, and having get_signal return instead of calling do_group_exit.
Tidied up the vhost_task abstraction so that the definition of vhost_task only needs to be visible inside of vhost_task.c. Making it easier to review the code and tell what needs to be done where. As part of this the main loop has been moved from vhost_worker into vhost_task_fn. vhost_worker now returns true if work was done.
The main loop has been updated to call get_signal which handles SIGSTOP, freezing, and collects the message that tells the thread to exit as part of process exit. This collection clears __fatal_signal_pending. This collection is not guaranteed to clear signal_pending() so clear that explicitly so the schedule() sleeps.
For now the vhost thread continues to exist and run work until the last file descriptor is closed and the release function is called as part of freeing struct file. To avoid hangs in the coredump rendezvous and when killing threads in a multi-threaded exec. The coredump code and de_thread have been modified to ignore vhost threads.
Remvoing the special case for exec appears to require teaching vhost_dev_flush how to directly complete transactions in case the vhost thread is no longer running.
Removing the special case for coredump rendezvous requires either the above fix needed for exec or moving the coredump rendezvous into get_signal.
Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Eric W. Biederman <[email protected]> Co-developed-by: Mike Christie <[email protected]> Signed-off-by: Mike Christie <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2 |
|
| #
89c8e98d |
| 10-Mar-2023 |
Mike Christie <[email protected]> |
fork: allow kernel code to call copy_process
The next patch adds helpers like create_io_thread, but for use by the vhost layer. There are several functions, so they are in their own file instead of
fork: allow kernel code to call copy_process
The next patch adds helpers like create_io_thread, but for use by the vhost layer. There are several functions, so they are in their own file instead of cluttering up fork.c. This patch allows that new file to call copy_process.
Signed-off-by: Mike Christie <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
show more ...
|
| #
09471758 |
| 10-Mar-2023 |
Mike Christie <[email protected]> |
fork: Add kernel_clone_args flag to ignore signals
Since:
commit 10ab825bdef8 ("change kernel threads to ignore signals instead of blocking them")
kthreads have been ignoring signals by default, a
fork: Add kernel_clone_args flag to ignore signals
Since:
commit 10ab825bdef8 ("change kernel threads to ignore signals instead of blocking them")
kthreads have been ignoring signals by default, and the vhost layer has never had a need to change that. This patch adds an option flag, USER_WORKER_SIG_IGN, handled in copy_process() after copy_sighand() and copy_signals() so vhost_tasks added in the next patches can continue to ignore singals.
Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Mike Christie <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
show more ...
|
| #
11f3f500 |
| 10-Mar-2023 |
Mike Christie <[email protected]> |
fork: add kernel_clone_args flag to not dup/clone files
Each vhost device gets a thread that is used to perform IO and management operations. Instead of a thread that is accessing a device, the thre
fork: add kernel_clone_args flag to not dup/clone files
Each vhost device gets a thread that is used to perform IO and management operations. Instead of a thread that is accessing a device, the thread is part of the device, so when it creates a thread using a helper based on copy_process we can't dup or clone the parent's files/FDS because it would do an extra increment on ourself.
Later, when we do:
Qemu process exits: do_exit -> exit_files -> put_files_struct -> close_files
we would leak the device's resources because of that extra refcount on the fd or file_struct.
This patch adds a no_files option so these worker threads can prevent taking an extra refcount on themselves.
Signed-off-by: Mike Christie <[email protected]> Acked-by: Christian Brauner <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
show more ...
|
| #
54e6842d |
| 10-Mar-2023 |
Mike Christie <[email protected]> |
fork/vm: Move common PF_IO_WORKER behavior to new flag
This adds a new flag, PF_USER_WORKER, that's used for behavior common to to both PF_IO_WORKER and users like vhost which will use a new helper
fork/vm: Move common PF_IO_WORKER behavior to new flag
This adds a new flag, PF_USER_WORKER, that's used for behavior common to to both PF_IO_WORKER and users like vhost which will use a new helper instead of create_io_thread because they require different behavior for operations like signal handling.
The common behavior PF_USER_WORKER covers is the vm reclaim handling.
Signed-off-by: Mike Christie <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
show more ...
|
| #
c81cc581 |
| 10-Mar-2023 |
Mike Christie <[email protected]> |
kernel: Make io_thread and kthread bit fields
We only set args->io_thread/kthread to 0 or 1 then test if they are set, so make them bit fields.
Signed-off-by: Mike Christie <michael.christie@oracle
kernel: Make io_thread and kthread bit fields
We only set args->io_thread/kthread to 0 or 1 then test if they are set, so make them bit fields.
Signed-off-by: Mike Christie <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
show more ...
|
| #
cf587db2 |
| 10-Mar-2023 |
Mike Christie <[email protected]> |
kernel: Allow a kernel thread's name to be set in copy_process
This patch allows kernel users to pass in the thread name so it can be set during creation instead of having to use set_task_comm after
kernel: Allow a kernel thread's name to be set in copy_process
This patch allows kernel users to pass in the thread name so it can be set during creation instead of having to use set_task_comm after the thread is created.
Signed-off-by: Mike Christie <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3 |
|
| #
3f4c8211 |
| 25-Oct-2022 |
Peter Zijlstra <[email protected]> |
x86/mm: Use mm_alloc() in poking_init()
Instead of duplicating init_mm, allocate a fresh mm. The advantage is that mm_alloc() has much simpler dependencies. Additionally it makes more conceptual sen
x86/mm: Use mm_alloc() in poking_init()
Instead of duplicating init_mm, allocate a fresh mm. The advantage is that mm_alloc() has much simpler dependencies. Additionally it makes more conceptual sense, init_mm has no (and must not have) user state to duplicate.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
af806027 |
| 25-Oct-2022 |
Peter Zijlstra <[email protected]> |
mm: Move mm_cachep initialization to mm_init()
In order to allow using mm_alloc() much earlier, move initializing mm_cachep into mm_init().
Signed-off-by: Peter Zijlstra (Intel) <[email protected]
mm: Move mm_cachep initialization to mm_init()
In order to allow using mm_alloc() much earlier, move initializing mm_cachep into mm_init().
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2 |
|
| #
2be9880d |
| 19-Aug-2022 |
Kefeng Wang <[email protected]> |
kernel: exit: cleanup release_thread()
Only x86 has own release_thread(), introduce a new weak release_thread() function to clean empty definitions in other ARCHs.
Link: https://lkml.kernel.org/r/2
kernel: exit: cleanup release_thread()
Only x86 has own release_thread(), introduce a new weak release_thread() function to clean empty definitions in other ARCHs.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Acked-by: Guo Ren <[email protected]> [csky] Acked-by: Russell King (Oracle) <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: Brian Cain <[email protected]> Acked-by: Michael Ellerman <[email protected]> [powerpc] Acked-by: Stafford Horne <[email protected]> [openrisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Huacai Chen <[email protected]> [LoongArch] Cc: Alexander Gordeev <[email protected]> Cc: Anton Ivanov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Guo Ren <[email protected]> [csky] Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: James Bottomley <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xuerui Wang <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7 |
|
| #
d5b36a4d |
| 11-Jul-2022 |
Oleg Nesterov <[email protected]> |
fix race between exit_itimers() and /proc/pid/timers
As Chris explains, the comment above exit_itimers() is not correct, we can race with proc_timers_seq_ops. Change exit_itimers() to clear signal->
fix race between exit_itimers() and /proc/pid/timers
As Chris explains, the comment above exit_itimers() is not correct, we can race with proc_timers_seq_ops. Change exit_itimers() to clear signal->posix_timers with ->siglock held.
Cc: <[email protected]> Reported-by: [email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3 |
|
| #
5bd2e97c |
| 12-Apr-2022 |
Eric W. Biederman <[email protected]> |
fork: Generalize PF_IO_WORKER handling
Add fn and fn_arg members into struct kernel_clone_args and test for them in copy_thread (instead of testing for PF_KTHREAD | PF_IO_WORKER). This allows any ta
fork: Generalize PF_IO_WORKER handling
Add fn and fn_arg members into struct kernel_clone_args and test for them in copy_thread (instead of testing for PF_KTHREAD | PF_IO_WORKER). This allows any task that wants to be a user space task that only runs in kernel mode to use this functionality.
The code on x86 is an exception and still retains a PF_KTHREAD test because x86 unlikely everything else handles kthreads slightly differently than user space tasks that start with a function.
The functions that created tasks that start with a function have been updated to set ".fn" and ".fn_arg" instead of ".stack" and ".stack_size". These functions are fork_idle(), create_io_thread(), kernel_thread(), and user_mode_thread().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
show more ...
|
| #
36cb0e1c |
| 11-Apr-2022 |
Eric W. Biederman <[email protected]> |
fork: Explicity test for idle tasks in copy_thread
The architectures ia64 and parisc have special handling for the idle thread in copy_process. Add a flag named idle to kernel_clone_args and use it
fork: Explicity test for idle tasks in copy_thread
The architectures ia64 and parisc have special handling for the idle thread in copy_process. Add a flag named idle to kernel_clone_args and use it to explicity test if an idle process is being created.
Fullfill the expectations of the rest of the copy_thread implemetations and pass a function pointer in .stack from fork_idle(). This makes what is happening in copy_thread better defined, and is useful to make idle threads less special.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc2 |
|
| #
c5febea0 |
| 08-Apr-2022 |
Eric W. Biederman <[email protected]> |
fork: Pass struct kernel_clone_args into copy_thread
With io_uring we have started supporting tasks that are for most purposes user space tasks that exclusively run code in kernel mode.
The kernel
fork: Pass struct kernel_clone_args into copy_thread
With io_uring we have started supporting tasks that are for most purposes user space tasks that exclusively run code in kernel mode.
The kernel task that exec's init and tasks that exec user mode helpers are also user mode tasks that just run kernel code until they call kernel execve.
Pass kernel_clone_args into copy_thread so these oddball tasks can be supported more cleanly and easily.
v2: Fix spelling of kenrel_clone_args on h8300 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
show more ...
|
| #
343f4c49 |
| 11-Apr-2022 |
Eric W. Biederman <[email protected]> |
kthread: Don't allocate kthread_struct for init and umh
If kthread_is_per_cpu runs concurrently with free_kthread_struct the kthread_struct that was just freed may be read from.
This bug was introd
kthread: Don't allocate kthread_struct for init and umh
If kthread_is_per_cpu runs concurrently with free_kthread_struct the kthread_struct that was just freed may be read from.
This bug was introduced by commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads"). When kthread_struct started to be allocated for all tasks that have PF_KTHREAD set. This in turn required the kthread_struct to be freed in kernel_execve and violated the assumption that kthread_struct will have the same lifetime as the task.
Looking a bit deeper this only applies to callers of kernel_execve which is just the init process and the user mode helper processes. These processes really don't want to be kernel threads but are for historical reasons. Mostly that copy_thread does not know how to take a kernel mode function to the process with for processes without PF_KTHREAD or PF_IO_WORKER set.
Solve this by not allocating kthread_struct for the init process and the user mode helper processes.
This is done by adding a kthread member to struct kernel_clone_args. Setting kthread in fork_idle and kernel_thread. Adding user_mode_thread that works like kernel_thread except it does not set kthread. In fork only allocating the kthread_struct if .kthread is set.
I have looked at kernel/kthread.c and since commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") there have been no assumptions added that to_kthread or __to_kthread will not return NULL.
There are a few callers of to_kthread or __to_kthread that assume a non-NULL struct kthread pointer will be returned. These functions are kthread_data(), kthread_parmme(), kthread_exit(), kthread(), kthread_park(), kthread_unpark(), kthread_stop(). All of those functions can reasonably expected to be called when it is know that a task is a kthread so that assumption seems reasonable.
Cc: [email protected] Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Reported-by: Максим Кутявин <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc1, v5.17, v5.17-rc8 |
|
| #
eae654f1 |
| 08-Mar-2022 |
Peter Zijlstra <[email protected]> |
exit: Mark do_group_exit() __noreturn
vmlinux.o: warning: objtool: get_signal()+0x108: unreachable instruction
0000 000000000007f930 <get_signal>: ... 0103 7fa33: e8 00 00 00 00 call
exit: Mark do_group_exit() __noreturn
vmlinux.o: warning: objtool: get_signal()+0x108: unreachable instruction
0000 000000000007f930 <get_signal>: ... 0103 7fa33: e8 00 00 00 00 call 7fa38 <get_signal+0x108> 7fa34: R_X86_64_PLT32 do_group_exit-0x4 0108 7fa38: 41 8b 45 74 mov 0x74(%r13),%eax
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|