History log of /linux-6.15/kernel/sched/core.c (Results 1 – 25 of 1396)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1
# 169eae77 27-Mar-2025 Mathieu Desnoyers <[email protected]>

rseq: Eliminate useless task_work on execve

Eliminate a useless task_work on execve by moving the call to
rseq_set_notify_resume() from sched_mm_cid_after_execve() to the error
path of bprm_execve()

rseq: Eliminate useless task_work on execve

Eliminate a useless task_work on execve by moving the call to
rseq_set_notify_resume() from sched_mm_cid_after_execve() to the error
path of bprm_execve().

The call to rseq_set_notify_resume() from sched_mm_cid_after_execve() is
pointless in the success case, because rseq_execve() will clear the rseq
pointer before returning to userspace.

sched_mm_cid_after_execve() is called from both the success and error
paths of bprm_execve(). The call to rseq_set_notify_resume() is needed
on error because the mm_cid may have changed.

Also move the rseq_execve() to right after sched_mm_cid_after_execve()
in bprm_execve().

[ mingo: Merged to a recent upstream kernel, extended the changelog. ]

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.14, v6.14-rc7, v6.14-rc6
# 26f80681 05-Mar-2025 Gabriele Monaco <[email protected]>

sched: Add sched tracepoints for RV task model

Add the following tracepoints:
* sched_entry(bool preempt, ip)
Called while entering __schedule
* sched_exit(bool is_switch, ip)
Called while e

sched: Add sched tracepoints for RV task model

Add the following tracepoints:
* sched_entry(bool preempt, ip)
Called while entering __schedule
* sched_exit(bool is_switch, ip)
Called while exiting __schedule
* sched_set_state(task, curr_state, state)
Called when a task changes its state (to and from running)

These tracepoints are useful to describe the Linux task model and are
adapted from the patches by Daniel Bristot de Oliveira
(https://bristot.me/linux-task-model/).

Cc: Ingo Molnar <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/[email protected]
Signed-off-by: Gabriele Monaco <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

show more ...


# dd5bdaf2 17-Mar-2025 Ingo Molnar <[email protected]>

sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional

All the big Linux distros enable CONFIG_SCHED_DEBUG, because
the various features it provides help not just with kernel
development,

sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional

All the big Linux distros enable CONFIG_SCHED_DEBUG, because
the various features it provides help not just with kernel
development, but with system administration and user-space
software development as well.

Reflect this reality and enable this functionality
unconditionally.

Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Shrikanth Hegde <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 57903f72 17-Mar-2025 Ingo Molnar <[email protected]>

sched/debug: Make 'const_debug' tunables unconditional __read_mostly

With CONFIG_SCHED_DEBUG becoming unconditional, remove the
extra 'const_debug' indirection towards __read_mostly.

Signed-off-by:

sched/debug: Make 'const_debug' tunables unconditional __read_mostly

With CONFIG_SCHED_DEBUG becoming unconditional, remove the
extra 'const_debug' indirection towards __read_mostly.

Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Shrikanth Hegde <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# f7d2728c 17-Mar-2025 Ingo Molnar <[email protected]>

sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()

The scheduler has this special SCHED_WARN() facility that
depends on CONFIG_SCHED_DEBUG.

Since CONFIG_SCHED_DEBUG is getting removed, convert
S

sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()

The scheduler has this special SCHED_WARN() facility that
depends on CONFIG_SCHED_DEBUG.

Since CONFIG_SCHED_DEBUG is getting removed, convert
SCHED_WARN() to WARN_ON_ONCE().

Note that the warning output isn't 100% equivalent:

#define SCHED_WARN_ON(x) WARN_ONCE(x, #x)

Because SCHED_WARN_ON() would output the 'x' condition
as well, while WARN_ONCE() will only show a backtrace.

Hopefully these are rare enough to not really matter.

If it does, we should probably introduce a new WARN_ON()
variant that outputs the condition in stringified form,
or improve WARN_ON() itself.

Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Shrikanth Hegde <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 2ff899e3 13-Mar-2025 Juri Lelli <[email protected]>

sched/deadline: Rebuild root domain accounting after every update

Rebuilding of root domains accounting information (total_bw) is
currently broken on some cases, e.g. suspend/resume on aarch64. Prob

sched/deadline: Rebuild root domain accounting after every update

Rebuilding of root domains accounting information (total_bw) is
currently broken on some cases, e.g. suspend/resume on aarch64. Problem
is that the way we keep track of domain changes and try to add bandwidth
back is convoluted and fragile.

Fix it by simplify things by making sure bandwidth accounting is cleared
and completely restored after root domains changes (after root domains
are again stable).

To be sure we always call dl_rebuild_rd_accounting while holding
cpuset_mutex we also add cpuset_reset_sched_domains() wrapper.

Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <[email protected]>
Co-developed-by: Waiman Long <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 56209334 13-Mar-2025 Juri Lelli <[email protected]>

sched/topology: Wrappers for sched_domains_mutex

Create wrappers for sched_domains_mutex so that it can transparently be
used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to
do.

F

sched/topology: Wrappers for sched_domains_mutex

Create wrappers for sched_domains_mutex so that it can transparently be
used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to
do.

Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Waiman Long <[email protected]>
Tested-by: Jon Hunter <[email protected]>
Tested-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 8bdc5daa 14-Mar-2025 Sebastian Andrzej Siewior <[email protected]>

sched: Add a generic function to return the preemption string

The individual architectures often add the preemption model to the begin
of the backtrace. This is the case on X86 or ARM64 for the "die

sched: Add a generic function to return the preemption string

The individual architectures often add the preemption model to the begin
of the backtrace. This is the case on X86 or ARM64 for the "die" case
but not for regular warning. With the addition of DYNAMIC_PREEMPT for
PREEMPT_RT we end up with CONFIG_PREEMPT and CONFIG_PREEMPT_RT set
simultaneously. That means that everyone who tried to add that piece of
information gets it wrong for PREEMPT_RT because PREEMPT is checked
first.

Provide a generic function which returns the current scheduling model
considering LAZY preempt and the current state of PREEMPT_DYNAMIC.

The resulting strings are:
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Model ┃ -RT -DYN ┃ +RT -DYN ┃ -RT +DYN ┃ +RT +DYN ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│NONE │ NONE │ n/a │ PREEMPT(none) │ n/a │
├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤
│VOLUNTARY │ VOLUNTARY │ n/a │ PREEMPT(voluntary) │ n/a │
├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤
│FULL │ PREEMPT │ PREEMPT_RT │ PREEMPT(full) │ PREEMPT_{RT,full} │
├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤
│LAZY │ PREEMPT_LAZY │ PREEMPT_{RT,LAZY} │ PREEMPT(lazy) │ PREEMPT_{RT,lazy} │
└───────────┴──────────────┴───────────────────┴────────────────────┴───────────────────┘

[ The dynamic building of the string can lead to an empty string if the
function is invoked simultaneously on two CPUs. ]

Co-developed-by: "Peter Zijlstra (Intel)" <[email protected]>
Signed-off-by: "Peter Zijlstra (Intel)" <[email protected]>
Co-developed-by: "Steven Rostedt (Google)" <[email protected]>
Signed-off-by: "Steven Rostedt (Google)" <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Shrikanth Hegde <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 76f970ce 14-Mar-2025 Dietmar Eggemann <[email protected]>

Revert "sched/core: Reduce cost of sched_move_task when config autogroup"

This reverts commit eff6c8ce8d4d7faef75f66614dd20bb50595d261.

Hazem reported a 30% drop in UnixBench spawn test with commit

Revert "sched/core: Reduce cost of sched_move_task when config autogroup"

This reverts commit eff6c8ce8d4d7faef75f66614dd20bb50595d261.

Hazem reported a 30% drop in UnixBench spawn test with commit
eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
(aarch64) (single level MC sched domain):

https://lkml.kernel.org/r/[email protected]

There is an early bail from sched_move_task() if p->sched_task_group is
equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
(Ubuntu '22.04.5 LTS').

So in:

do_exit()

sched_autogroup_exit_task()

sched_move_task()

if sched_get_task_group(p) == p->sched_task_group
return

/* p is enqueued */
dequeue_task() \
sched_change_group() |
task_change_group_fair() |
detach_task_cfs_rq() | (1)
set_task_rq() |
attach_task_cfs_rq() |
enqueue_task() /

(1) isn't called for p anymore.

Turns out that the regression is related to sgs->group_util in
group_is_overloaded() and group_has_capacity(). If (1) isn't called for
all the 'spawn' tasks then sgs->group_util is ~900 and
sgs->group_capacity = 1024 (single CPU sched domain) and this leads to
group_is_overloaded() returning true (2) and group_has_capacity() false
(3) much more often compared to the case when (1) is called.

I.e. there are much more cases of 'group_is_overloaded' and
'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
then returns much more often a CPU != smp_processor_id() (5).

This isn't good for these extremely short running tasks (FORK + EXIT)
and also involves calling sched_balance_find_dst_group_cpu() unnecessary
(single CPU sched domain).

Instead if (1) is called for 'p->flags & PF_EXITING' then the path
(4),(6) is taken much more often.

select_task_rq_fair(..., wake_flags = WF_FORK)

cpu = smp_processor_id()

new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)

group = sched_balance_find_dst_group(..., cpu)

do {

update_sg_wakeup_stats()

sgs->group_type = group_classify()

if group_is_overloaded() (2)
return group_overloaded

if !group_has_capacity() (3)
return group_fully_busy

return group_has_spare (4)

} while group

if local_sgs.group_type > idlest_sgs.group_type
return idlest (5)

case group_has_spare:

if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
return NULL (6)

Unixbench Tests './Run -c 4 spawn' on:

(a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
and Ubuntu 22.04.5 LTS (aarch64).

Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.

w/o patch w/ patch
21005 27120

(b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
Ubuntu 22.04.5 LTS (x86_64).

Shell & test run in '/A'.

w/o patch w/ patch
67675 88806

CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal
0 or 1.

Reported-by: Hazem Mohamed Abuelfotoh <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Tested-by: Hagar Hemdan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.14-rc5, v6.14-rc4
# 4bc45824 19-Feb-2025 Xuewen Yan <[email protected]>

sched/uclamp: Optimize sched_uclamp_used static key enabling

Repeat calls of static_branch_enable() to an already enabled
static key introduce overhead, because it calls cpus_read_lock().

Users may

sched/uclamp: Optimize sched_uclamp_used static key enabling

Repeat calls of static_branch_enable() to an already enabled
static key introduce overhead, because it calls cpus_read_lock().

Users may frequently set the uclamp value of tasks, triggering
the repeat enabling of the sched_uclamp_used static key.

Optimize this and avoid repeat calls to static_branch_enable()
by checking whether it's enabled already.

[ mingo: Rewrote the changelog for legibility ]

Signed-off-by: Xuewen Yan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Christian Loehle <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 5fca5a4c 19-Feb-2025 Xuewen Yan <[email protected]>

sched/uclamp: Use the uclamp_is_used() helper instead of open-coding it

Don't open-code static_branch_unlikely(&sched_uclamp_used), we have
the uclamp_is_used() wrapper around it.

[ mingo: Clean up

sched/uclamp: Use the uclamp_is_used() helper instead of open-coding it

Don't open-code static_branch_unlikely(&sched_uclamp_used), we have
the uclamp_is_used() wrapper around it.

[ mingo: Clean up the changelog ]

Signed-off-by: Xuewen Yan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Hongyan Xia <[email protected]>
Reviewed-by: Christian Loehle <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4
# 82c387ef 16-Dec-2024 Thomas Gleixner <[email protected]>

sched/core: Prevent rescheduling when interrupts are disabled

David reported a warning observed while loop testing kexec jump:

Interrupts enabled after irqrouter_resume+0x0/0x50
WARNING: CPU: 0

sched/core: Prevent rescheduling when interrupts are disabled

David reported a warning observed while loop testing kexec jump:

Interrupts enabled after irqrouter_resume+0x0/0x50
WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
kernel_kexec+0xf6/0x180
__do_sys_reboot+0x206/0x250
do_syscall_64+0x95/0x180

The corresponding interrupt flag trace:

hardirqs last enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90
hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90

That means __up_console_sem() was invoked with interrupts enabled. Further
instrumentation revealed that in the interrupt disabled section of kexec
jump one of the syscore_suspend() callbacks woke up a task, which set the
NEED_RESCHED flag. A later callback in the resume path invoked
cond_resched() which in turn led to the invocation of the scheduler:

__cond_resched+0x21/0x60
down_timeout+0x18/0x60
acpi_os_wait_semaphore+0x4c/0x80
acpi_ut_acquire_mutex+0x3d/0x100
acpi_ns_get_node+0x27/0x60
acpi_ns_evaluate+0x1cb/0x2d0
acpi_rs_set_srs_method_data+0x156/0x190
acpi_pci_link_set+0x11c/0x290
irqrouter_resume+0x54/0x60
syscore_resume+0x6a/0x200
kernel_kexec+0x145/0x1c0
__do_sys_reboot+0xeb/0x240
do_syscall_64+0x95/0x180

This is a long standing problem, which probably got more visible with
the recent printk changes. Something does a task wakeup and the
scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and
invokes schedule() from a completely bogus context. The scheduler
enables interrupts after context switching, which causes the above
warning at the end.

Quite some of the code paths in syscore_suspend()/resume() can result in
triggering a wakeup with the exactly same consequences. They might not
have done so yet, but as they share a lot of code with normal operations
it's just a question of time.

The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling
models. Full preemption is not affected as cond_resched() is disabled and
the preemption check preemptible() takes the interrupt disabled flag into
account.

Cure the problem by adding a corresponding check into cond_resched().

Reported-by: David Woodhouse <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: David Woodhouse <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Closes: https://lore.kernel.org/all/[email protected]

show more ...


# b796ea84 19-Feb-2025 Thorsten Blum <[email protected]>

sched/core: Remove duplicate included header file stats.h

The header file stats.h is included twice. Remove the redundant include
and the following make includecheck warning:

stats.h is included

sched/core: Remove duplicate included header file stats.h

The header file stats.h is included twice. Remove the redundant include
and the following make includecheck warning:

stats.h is included more than once

Signed-off-by: Thorsten Blum <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# ee13da87 05-Feb-2025 Nam Cao <[email protected]>

sched: Switch to use hrtimer_setup()

hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization o

sched: Switch to use hrtimer_setup()

hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Signed-off-by: Nam Cao <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/a55e849cba3c41b4c5708be6ea6be6f337d1a8fb.1738746821.git.namcao@linutronix.de

show more ...


# 3539c641 12-Feb-2025 Tejun Heo <[email protected]>

sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP

A task wakeup can be either processed on the waker's CPU or bounced to the
wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CP

sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP

A task wakeup can be either processed on the waker's CPU or bounced to the
wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU
avoids the waker's CPU locking and accessing the wakee's rq which can be
expensive across cache and node boundaries.

When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu()
may be skipped in some cases (racing against the wakee switching out). As
this confused some BPF schedulers, there wasn't a good way for a BPF
scheduler to tell whether idle CPU selection has been skipped, ops.enqueue()
couldn't insert tasks into foreign local DSQs, and the performance
difference on machines with simple toplogies were minimal, sched_ext
disabled ttwu_queue.

However, this optimization makes noticeable difference on more complex
topologies and a BPF scheduler now has an easy way tell whether
ops.select_cpu() was skipped since 9b671793c7d9 ("sched_ext, scx_qmap: Add
and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs
since 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct
dispatches").

Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose
to enable ttwu_queue optimization.

v2: Update the patch description and comment re. ops.select_cpu() being
skipped in some cases as opposed to always as per Neel.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Neel Natu <[email protected]>
Reported-by: Barret Rhoden <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Andrea Righi <[email protected]>

show more ...


# bcc6244e 29-Jan-2025 Jann Horn <[email protected]>

sched: Clarify wake_up_q()'s write to task->wake_q.next

Clarify that wake_up_q() does an atomic write to task->wake_q.next, after
which a concurrent __wake_q_add() can immediately overwrite
task->wa

sched: Clarify wake_up_q()'s write to task->wake_q.next

Clarify that wake_up_q() does an atomic write to task->wake_q.next, after
which a concurrent __wake_q_add() can immediately overwrite
task->wake_q.next again.

Signed-off-by: Jann Horn <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


Revision tags: v6.13-rc3
# 2c00e119 13-Dec-2024 Ankur Arora <[email protected]>

sched: update __cond_resched comment about RCU quiescent states

Update comment in __cond_resched() clarifying how urgently needed
quiescent state are provided.

Signed-off-by: Ankur Arora <ankur.a.a

sched: update __cond_resched comment about RCU quiescent states

Update comment in __cond_resched() clarifying how urgently needed
quiescent state are provided.

Signed-off-by: Ankur Arora <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Boqun Feng <[email protected]>

show more ...


# 1751f872 28-Jan-2025 Joel Granados <[email protected]>

treewide: const qualify ctl_tables where applicable

Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysc

treewide: const qualify ctl_tables where applicable

Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
virtual patch

@
depends on !(file in "net")
disable optional_qualifier
@

identifier table_name != {
watchdog_hardlockup_sysctl,
iwcm_ctl_table,
ucma_ctl_table,
memory_allocation_profiling_sysctls,
loadpin_sysctl_table
};
@@

+ const
struct ctl_table table_name [] = { ... };

sed:
sed --in-place \
-e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
kernel/utsname_sysctl.c

Reviewed-by: Song Liu <[email protected]>
Acked-by: Steven Rostedt (Google) <[email protected]> # for kernel/trace/
Reviewed-by: Martin K. Petersen <[email protected]> # SCSI
Reviewed-by: Darrick J. Wong <[email protected]> # xfs
Acked-by: Jani Nikula <[email protected]>
Acked-by: Corey Minyard <[email protected]>
Acked-by: Wei Liu <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Reviewed-by: Bill O'Donnell <[email protected]>
Acked-by: Baoquan He <[email protected]>
Acked-by: Ashutosh Dixit <[email protected]>
Acked-by: Anna Schumaker <[email protected]>
Signed-off-by: Joel Granados <[email protected]>

show more ...


# d6f3e7d5 24-Jan-2025 Tejun Heo <[email protected]>

sched_ext: Fix incorrect autogroup migration detection

scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is use

sched_ext: Fix incorrect autogroup migration detection

scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.

This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:

WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
...
Call Trace:
<TASK>
cgroup_migrate_execute+0x5b1/0x700
cgroup_attach_task+0x296/0x400
__cgroup_procs_write+0x128/0x140
cgroup_procs_write+0x17/0x30
kernfs_fop_write_iter+0x141/0x1f0
vfs_write+0x31d/0x4a0
__x64_sys_write+0x72/0xf0
do_syscall_64+0x82/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().

Link: https://github.com/sched-ext/scx/issues/370
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: [email protected] # v6.12+
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# 65ef17aa 10-Jan-2025 Oxana Kharitonova <[email protected]>

hung_task: add task->flags, blocked by coredump to log

Resending this patch as I haven't received feedback on my initial
submission https://lore.kernel.org/all/20241204182953.10854-1-oxana@cloudflar

hung_task: add task->flags, blocked by coredump to log

Resending this patch as I haven't received feedback on my initial
submission https://lore.kernel.org/all/[email protected]/

For the processes which are terminated abnormally the kernel can provide
a coredump if enabled. When the coredump is performed, the process and
all its threads are put into the D state
(TASK_UNINTERRUPTIBLE | TASK_FREEZABLE).

On the other hand, we have kernel thread khungtaskd which monitors the
processes in the D state. If the task stuck in the D state more than
kernel.hung_task_timeout_secs, the hung_task alert appears in the kernel
log.

The higher memory usage of a process, the longer it takes to create
coredump, the longer tasks are in the D state. We have hung_task alerts
for the processes with memory usage above 10Gb. Although, our
kernel.hung_task_timeout_secs is 10 sec when the default is 120 sec.

Adding additional information to the log that the task is blocked by
coredump will help with monitoring. Another approach might be to
completely filter out alerts for such tasks, but in that case we would
lose transparency about what is putting pressure on some system
resources, e.g. we saw an increase in I/O when coredump occurs due its
writing to disk.

Additionally, it would be helpful to have task_struct->flags in the log
from the function sched_show_task(). Currently it prints
task_struct->thread_info->flags, this seems misleading as the line
starts with "task:xxxx".

[[email protected]: fix printk control string]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oxana Kharitonova <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Vincent Guittot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7
# 21641bd9 04-Nov-2024 Nicholas Piggin <[email protected]>

lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN

CPU unplug first calls __cpu_disable(), and that's where powerpc calls
cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask()

lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN

CPU unplug first calls __cpu_disable(), and that's where powerpc calls
cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask() of all
mms in the system.

However this CPU may still be using a lazy tlb mm, and its mm_cpumask bit
will be cleared from it. The CPU does not switch away from the lazy tlb
mm until arch_cpu_idle_dead() calls idle_task_exit().

If that user mm exits in this window, it will not be subject to the lazy
tlb mm shootdown and may be freed while in use as a lazy mm by the CPU
that is being unplugged.

cleanup_cpu_mmu_context() could be moved later, but it looks better to
move the lazy tlb mm switching earlier. The problem with doing the lazy
mm switching in idle_task_exit() is explained in commit bf2c59fce4074
("sched/core: Fix illegal RCU from offline CPUs"), which added a wart to
switch away from the mm but leave it set in active_mm to be cleaned up
later.

So instead, switch away from the lazy tlb mm at sched_cpu_wait_empty(),
which is the last hotplug state before teardown
(CPUHP_AP_SCHED_WAIT_EMPTY). This CPU will never switch to a user thread
from this point, so it has no chance to pick up a new lazy tlb mm. This
removes the lazy tlb mm handling wart in CPU unplug.

With this, idle_task_exit() is not needed anymore and can be cleaned up.
This leaves the prototype alone, to be cleaned after this change.

herton: took the suggestions from https://lore.kernel.org/all/87jzvyprsw.ffs@tglx/
and made adjustments on the initial patch proposed by Nicholas.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2655421ae69f ("lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme")
Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Herton R. Krzesinski <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# d40797d6 22-Nov-2024 Peter Zijlstra <[email protected]>

kasan: make kasan_record_aux_stack_noalloc() the default behaviour

kasan_record_aux_stack_noalloc() was introduced to record a stack trace
without allocating memory in the process. It has been adde

kasan: make kasan_record_aux_stack_noalloc() the default behaviour

kasan_record_aux_stack_noalloc() was introduced to record a stack trace
without allocating memory in the process. It has been added to callers
which were invoked while a raw_spinlock_t was held. More and more callers
were identified and changed over time. Is it a good thing to have this
while functions try their best to do a locklessly setup? The only
downside of having kasan_record_aux_stack() not allocate any memory is
that we end up without a stacktrace if stackdepot runs out of memory and
at the same stacktrace was not recorded before To quote Marco Elver from
https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/

| I'd be in favor, it simplifies things. And stack depot should be
| able to replenish its pool sufficiently in the "non-aux" cases
| i.e. regular allocations. Worst case we fail to record some
| aux stacks, but I think that's only really bad if there's a bug
| around one of these allocations. In general the probabilities
| of this being a regression are extremely small [...]

Make the kasan_record_aux_stack_noalloc() behaviour default as
kasan_record_aux_stack().

[[email protected]: dressed the diff as patch]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7cb3007ce2da ("kasan: generic: introduce kasan_record_aux_stack_noalloc()")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]
Reviewed-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: [email protected]
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zqiang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 7d9da040 27-Dec-2024 Chengming Zhou <[email protected]>

psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:

psi: inconsisten

psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:

psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4

When investigating the series of events leading up to the splat,
following sequence was observed:

[008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
...
[008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
[008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
# CPU8 goes into newidle balance and releases the rq lock
...
# CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
[015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
[015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
[008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...

psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
for the blocked entity, however, with the introduction of DELAY_DEQUEUE,
the block task can wakeup when newidle balance drops the runqueue lock
during __schedule().

If a task wakes before psi_sched_switch() adjusts the PSI flags, skip
any modifications in psi_enqueue() which would still see the flags of a
running task and not a blocked one. Instead, rely on psi_sched_switch()
to do the right thing.

Since the status returned by try_to_block_task() may no longer be true
by the time schedule reaches psi_sched_switch(), check if the task is
blocked or not using a combination of task_on_rq_queued() and
p->se.sched_delayed checks.

[ prateek: Commit message, testing, early bailout in psi_enqueue() ]

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") # 1a6151017ee5
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 763a744e 03-Jan-2025 Yafang Shao <[email protected]>

sched: Don't account irq time if sched_clock_irqtime is disabled

sched_clock_irqtime may be disabled due to the clock source, in which case
IRQ time should not be accounted. Let's add a conditional

sched: Don't account irq time if sched_clock_irqtime is disabled

sched_clock_irqtime may be disabled due to the clock source, in which case
IRQ time should not be accounted. Let's add a conditional check to avoid
unnecessary logic.

Signed-off-by: Yafang Shao <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Michal Koutný <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 3a9910b5 09-Jan-2025 Changwoo Min <[email protected]>

sched_ext: Implement scx_bpf_now()

Returns a high-performance monotonically non-decreasing clock for the current
CPU. The clock returned is in nanoseconds.

It provides the following properties:

1)

sched_ext: Implement scx_bpf_now()

Returns a high-performance monotonically non-decreasing clock for the current
CPU. The clock returned is in nanoseconds.

It provides the following properties:

1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
to account for execution time and track tasks' runtime properties.
Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
eventually reads a hardware timestamp counter -- is neither performant nor
scalable. scx_bpf_now() aims to provide a high-performance clock by
using the rq clock in the scheduler core whenever possible.

2) High enough resolution for the BPF scheduler use cases: In most BPF
scheduler use cases, the required clock resolution is lower than the most
accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now() basically
uses the rq clock in the scheduler core whenever it is valid. It considers
that the rq clock is valid from the time the rq clock is updated
(update_rq_clock) until the rq is unlocked (rq_unpin_lock).

3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now()
guarantees the clock never goes backward when comparing them in the same
CPU. On the other hand, when comparing clocks in different CPUs, there
is no such guarantee -- the clock can go backward. It provides a
monotonically *non-decreasing* clock so that it would provide the same
clock values in two different scx_bpf_now() calls in the same CPU
during the same period of when the rq clock is valid.

An rq clock becomes valid when it is updated using update_rq_clock()
and invalidated when the rq is unlocked using rq_unpin_lock().

Let's suppose the following timeline in the scheduler core:

T1. rq_lock(rq)
T2. update_rq_clock(rq)
T3. a sched_ext BPF operation
T4. rq_unlock(rq)
T5. a sched_ext BPF operation
T6. rq_lock(rq)
T7. update_rq_clock(rq)

For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_VALID is
set), so scx_bpf_now() calls during [T2, T4) (including T3) will
return the rq clock updated at T2. For duration [T4, T7), when a BPF
scheduler can still call scx_bpf_now() (T5), we consider the rq clock
is invalid (SCX_RQ_CLK_VALID is unset at T4). So when calling
scx_bpf_now() at T5, we will return a fresh clock value by calling
sched_clock_cpu() internally. Also, to prevent getting outdated rq clocks
from a previous scx scheduler, invalidate all the rq clocks when unloading
a BPF scheduler.

One example of calling scx_bpf_now(), when the rq clock is invalid
(like T5), is in scx_central [1]. The scx_central scheduler uses a BPF
timer for preemptive scheduling. In every msec, the timer callback checks
if the currently running tasks exceed their timeslice. At the beginning of
the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central
gets the current time. When the BPF timer callback runs, the rq clock could
be invalid, the same as T5. In this case, scx_bpf_now() returns a fresh
clock value rather than returning the old one (T2).

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_central.bpf.c

Signed-off-by: Changwoo Min <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


12345678910>>...56