History log of /linux-6.15/kernel/sched/debug.c (Results 1 – 25 of 160)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7
# 56209334 13-Mar-2025 Juri Lelli <[email protected]>

sched/topology: Wrappers for sched_domains_mutex

Create wrappers for sched_domains_mutex so that it can transparently be
used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to
do.

F

sched/topology: Wrappers for sched_domains_mutex

Create wrappers for sched_domains_mutex so that it can transparently be
used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to
do.

Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Waiman Long <[email protected]>
Tested-by: Jon Hunter <[email protected]>
Tested-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 8bdc5daa 14-Mar-2025 Sebastian Andrzej Siewior <[email protected]>

sched: Add a generic function to return the preemption string

The individual architectures often add the preemption model to the begin
of the backtrace. This is the case on X86 or ARM64 for the "die

sched: Add a generic function to return the preemption string

The individual architectures often add the preemption model to the begin
of the backtrace. This is the case on X86 or ARM64 for the "die" case
but not for regular warning. With the addition of DYNAMIC_PREEMPT for
PREEMPT_RT we end up with CONFIG_PREEMPT and CONFIG_PREEMPT_RT set
simultaneously. That means that everyone who tried to add that piece of
information gets it wrong for PREEMPT_RT because PREEMPT is checked
first.

Provide a generic function which returns the current scheduling model
considering LAZY preempt and the current state of PREEMPT_DYNAMIC.

The resulting strings are:
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Model ┃ -RT -DYN ┃ +RT -DYN ┃ -RT +DYN ┃ +RT +DYN ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│NONE │ NONE │ n/a │ PREEMPT(none) │ n/a │
├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤
│VOLUNTARY │ VOLUNTARY │ n/a │ PREEMPT(voluntary) │ n/a │
├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤
│FULL │ PREEMPT │ PREEMPT_RT │ PREEMPT(full) │ PREEMPT_{RT,full} │
├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤
│LAZY │ PREEMPT_LAZY │ PREEMPT_{RT,LAZY} │ PREEMPT(lazy) │ PREEMPT_{RT,lazy} │
└───────────┴──────────────┴───────────────────┴────────────────────┴───────────────────┘

[ The dynamic building of the string can lead to an empty string if the
function is invoked simultaneously on two CPUs. ]

Co-developed-by: "Peter Zijlstra (Intel)" <[email protected]>
Signed-off-by: "Peter Zijlstra (Intel)" <[email protected]>
Co-developed-by: "Steven Rostedt (Google)" <[email protected]>
Signed-off-by: "Steven Rostedt (Google)" <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Shrikanth Hegde <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1
# 9065ce69 29-Jan-2025 Christian Loehle <[email protected]>

sched/debug: Provide slice length for fair tasks

Since commit:

857b158dc5e8 ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion")

... we have the userspace per-task tuna

sched/debug: Provide slice length for fair tasks

Since commit:

857b158dc5e8 ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion")

... we have the userspace per-task tunable slice length, which is
a key parameter that is otherwise difficult to obtain, so provide
it in /proc/$PID/sched.

[ mingo: Clarified the changelog. ]

Signed-off-by: Christian Loehle <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.13, v6.13-rc7
# 8061b9f5 10-Jan-2025 David Rientjes <[email protected]>

sched/debug: Change need_resched warnings to pr_err

need_resched warnings, if enabled, are treated as WARNINGs. If
kernel.panic_on_warn is enabled, then this causes a kernel panic.

It's highly unl

sched/debug: Change need_resched warnings to pr_err

need_resched warnings, if enabled, are treated as WARNINGs. If
kernel.panic_on_warn is enabled, then this causes a kernel panic.

It's highly unlikely that a panic is desired for these warnings, only a
stack trace is normally required to debug and resolve.

Thus, switch need_resched warnings to simply be a printk with an
associated stack trace so they are no longer in scope for panic_on_warn.

Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Madadi Vineeth Reddy <[email protected]>
Acked-by: Josh Don <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2
# 736c55a0 02-Dec-2024 Vincent Guittot <[email protected]>

sched/fair: Rename cfs_rq.nr_running into nr_queued

Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the
reality as the value includes both the ready to run tasks and the delayed

sched/fair: Rename cfs_rq.nr_running into nr_queued

Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the
reality as the value includes both the ready to run tasks and the delayed
dequeue tasks.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 43eef7c3 02-Dec-2024 Vincent Guittot <[email protected]>

sched/fair: Remove unused cfs_rq.idle_nr_running

cfs_rq.idle_nr_running field is not used anywhere so we can remove the
useless associated computation. Last user went in commit 5e963f2bd465
("sched/

sched/fair: Remove unused cfs_rq.idle_nr_running

cfs_rq.idle_nr_running field is not used anywhere so we can remove the
useless associated computation. Last user went in commit 5e963f2bd465
("sched/fair: Commit to EEVDF").

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 31898e7b 02-Dec-2024 Vincent Guittot <[email protected]>

sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle

Use same naming convention as others starting with h_nr_* and rename
idle_h_nr_running into h_nr_idle.
The "running" is not correct anymore

sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle

Use same naming convention as others starting with h_nr_* and rename
idle_h_nr_running into h_nr_idle.
The "running" is not correct anymore as it includes delayed dequeue tasks
as well.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 9216582b 02-Dec-2024 Vincent Guittot <[email protected]>

sched/fair: Removed unsued cfs_rq.h_nr_delayed

h_nr_delayed is not used anymore. We now have:
- h_nr_runnable which tracks tasks ready to run
- h_nr_queued which tracks enqueued tasks either ready

sched/fair: Removed unsued cfs_rq.h_nr_delayed

h_nr_delayed is not used anymore. We now have:
- h_nr_runnable which tracks tasks ready to run
- h_nr_queued which tracks enqueued tasks either ready to run or
delayed dequeue

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# c2a295bf 02-Dec-2024 Vincent Guittot <[email protected]>

sched/fair: Add new cfs_rq.h_nr_runnable

With delayed dequeued feature, a sleeping sched_entity remains queued in
the rq until its lag has elapsed. As a result, it stays also visible
in the statisti

sched/fair: Add new cfs_rq.h_nr_runnable

With delayed dequeued feature, a sleeping sched_entity remains queued in
the rq until its lag has elapsed. As a result, it stays also visible
in the statistics that are used to balance the system and in particular
the field cfs.h_nr_queued when the sched_entity is associated to a task.

Create a new h_nr_runnable that tracks only queued and runnable tasks.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 7b8a702d 02-Dec-2024 Vincent Guittot <[email protected]>

sched/fair: Rename h_nr_running into h_nr_queued

With delayed dequeued feature, a sleeping sched_entity remains queued
in the rq until its lag has elapsed but can't run.
Rename h_nr_running into h_n

sched/fair: Rename h_nr_running into h_nr_queued

With delayed dequeued feature, a sleeping sched_entity remains queued
in the rq until its lag has elapsed but can't run.
Rename h_nr_running into h_nr_queued to reflect this new behavior.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 76f2f783 02-Dec-2024 Peter Zijlstra <[email protected]>

sched/eevdf: More PELT vs DELAYED_DEQUEUE

Vincent and Dietmar noted that while
commit fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes
the entity runnable stats, it does not adjust

sched/eevdf: More PELT vs DELAYED_DEQUEUE

Vincent and Dietmar noted that while
commit fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes
the entity runnable stats, it does not adjust the cfs_rq runnable stats,
which are based off of h_nr_running.

Track h_nr_delayed such that we can discount those and adjust the
signal.

Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE")
Closes: https://lore.kernel.org/lkml/[email protected]/
Closes: https://lore.kernel.org/lkml/CAKfTPtCNUvWE_GX5LyvTF-WdxUT=ZgvZZv-4t=eWntg5uOFqiQ@mail.gmail.com/
[ Fixes checkpatch warnings and rebased ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reported-by: Dietmar Eggemann <[email protected]>
Reported-by: Vincent Guittot <[email protected]>
Signed-off-by: "Peter Zijlstra (Intel)" <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: K Prateek Nayak <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2
# 35772d62 04-Oct-2024 Peter Zijlstra <[email protected]>

sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT

In order to enable PREEMPT_DYNAMIC for PREEMPT_RT, remove PREEMPT_RT
from the 'Preemption Model' choice. Strictly speaking PREEMPT_RT is
not a change in

sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT

In order to enable PREEMPT_DYNAMIC for PREEMPT_RT, remove PREEMPT_RT
from the 'Preemption Model' choice. Strictly speaking PREEMPT_RT is
not a change in how preemption works, but rather it makes a ton more
code preemptible.

Notably, take away NONE and VOLUNTARY options for PREEMPT_RT, they make
no sense (but are techincally possible).

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 7c70cb94 04-Oct-2024 Peter Zijlstra <[email protected]>

sched: Add Lazy preemption model

Change fair to use resched_curr_lazy(), which, when the lazy
preemption model is selected, will set TIF_NEED_RESCHED_LAZY.

This LAZY bit will be promoted to the ful

sched: Add Lazy preemption model

Change fair to use resched_curr_lazy(), which, when the lazy
preemption model is selected, will set TIF_NEED_RESCHED_LAZY.

This LAZY bit will be promoted to the full NEED_RESCHED bit on tick.
As such, the average delay between setting LAZY and actually
rescheduling will be TICK_NSEC/2.

In short, Lazy preemption will delay preemption for fair class but
will function as Full preemption for all the other classes, most
notably the realtime (RR/FIFO/DEADLINE) classes.

The goal is to bridge the performance gap with Voluntary, such that we
might eventually remove that option entirely.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


Revision tags: v6.12-rc1, v6.11, v6.11-rc7
# 2cab4bd0 06-Sep-2024 Huang Shijie <[email protected]>

sched/debug: Fix the runnable tasks output

The current runnable tasks output looks like:

runnable tasks:
S task PID tree-key switches prio wait-time sum-

sched/debug: Fix the runnable tasks output

The current runnable tasks output looks like:

runnable tasks:
S task PID tree-key switches prio wait-time sum-exec sum-sleep
-------------------------------------------------------------------------------------------------------------
Ikworker/R-rcu_g 4 0.129049 E 0.620179 0.750000 0.002920 2 100 0.000000 0.002920 0.000000 0.000000 0 0 /
Ikworker/R-sync_ 5 0.125328 E 0.624147 0.750000 0.001840 2 100 0.000000 0.001840 0.000000 0.000000 0 0 /
Ikworker/R-slub_ 6 0.120835 E 0.628680 0.750000 0.001800 2 100 0.000000 0.001800 0.000000 0.000000 0 0 /
Ikworker/R-netns 7 0.114294 E 0.634701 0.750000 0.002400 2 100 0.000000 0.002400 0.000000 0.000000 0 0 /
I kworker/0:1 9 508.781746 E 511.754666 3.000000 151.575240 224 120 0.000000 151.575240 0.000000 0.000000 0 0 /

Which is messy. Remove the duplicate printing of sum_exec_runtime and
tidy up the layout to make it look like:

runnable tasks:
S task PID vruntime eligible deadline slice sum-exec switches prio wait-time sum-sleep sum-block node group-id group-path
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I kworker/0:3 1698 295.001459 E 297.977619 3.000000 38.862920 9 120 0.000000 0.000000 0.000000 0 0 /
I kworker/0:4 1702 278.026303 E 281.026303 3.000000 9.918760 3 120 0.000000 0.000000 0.000000 0 0 /
S NetworkManager 2646 0.377936 E 2.598104 3.000000 98.535880 314 120 0.000000 0.000000 0.000000 0 0 /system.slice/NetworkManager.service
S virtqemud 2689 0.541016 E 2.440104 3.000000 50.967960 80 120 0.000000 0.000000 0.000000 0 0 /system.slice/virtqemud.service
S gsd-smartcard 3058 73.604144 E 76.475904 3.000000 74.033320 88 120 0.000000 0.000000 0.000000 0 0 /user.slice/user-42.slice/session-c1.scope

Reviewed-by: Christoph Lameter (Ampere) <[email protected]>
Signed-off-by: Huang Shijie <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4
# 857b158d 22-May-2023 Peter Zijlstra <[email protected]>

sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

Allow applications to directly set a suggested request/slice length using
sched_attr::sched_runtime.

The implementation cl

sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

Allow applications to directly set a suggested request/slice length using
sched_attr::sched_runtime.

The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.

Applications should strive to use their periodic runtime at a high
confidence interval (95%+) as the target slice. Using a smaller slice
will introduce undue preemptions, while using a larger value will
increase latency.

For all the following examples assume a scheduling quantum of 8, and for
consistency all examples have W=4:

{A,B,C,D}(w=1,r=8):

ABCD...
+---+---+---+---

t=0, V=1.5 t=1, V=3.5
A |------< A |------<
B |------< B |------<
C |------< C |------<
D |------< D |------<
---+*------+-------+--- ---+--*----+-------+---

t=2, V=5.5 t=3, V=7.5
A |------< A |------<
B |------< B |------<
C |------< C |------<
D |------< D |------<
---+----*--+-------+--- ---+------*+-------+---

Note: 4 identical tasks in FIFO order

~~~

{A,B}(w=1,r=16) C(w=2,r=16)

AACCBBCC...
+---+---+---+---

t=0, V=1.25 t=2, V=5.25
A |--------------< A |--------------<
B |--------------< B |--------------<
C |------< C |------<
---+*------+-------+--- ---+----*--+-------+---

t=4, V=8.25 t=6, V=12.25
A |--------------< A |--------------<
B |--------------< B |--------------<
C |------< C |------<
---+-------*-------+--- ---+-------+---*---+---

Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2
task doesn't go below q.

Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length.

Note: the period of the heavy task is half the full period at:
W*(r_i/w_i) = 4*(2q/2) = 4q

~~~

{A,C,D}(w=1,r=16) B(w=1,r=8):

BAACCBDD...
+---+---+---+---

t=0, V=1.5 t=1, V=3.5
A |--------------< A |---------------<
B |------< B |------<
C |--------------< C |--------------<
D |--------------< D |--------------<
---+*------+-------+--- ---+--*----+-------+---

t=3, V=7.5 t=5, V=11.5
A |---------------< A |---------------<
B |------< B |------<
C |--------------< C |--------------<
D |--------------< D |--------------<
---+------*+-------+--- ---+-------+--*----+---

t=6, V=13.5
A |---------------<
B |------<
C |--------------<
D |--------------<
---+-------+----*--+---

Note: 1 short task -- again double r so that the deadline of the short task
won't be below q. Made B short because its not the leftmost task, but is
eligible with the 0,1,2,3 spread.

Note: like with the heavy task, the period of the short task observes:
W*(r_i/w_i) = 4*(1q/1) = 4q

~~~

A(w=1,r=16) B(w=1,r=8) C(w=2,r=16)

BCCAABCC...
+---+---+---+---

t=0, V=1.25 t=1, V=3.25
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+*------+-------+--- ---+--*----+-------+---

t=3, V=7.25 t=5, V=11.25
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+------*+-------+--- ---+-------+--*----+---

t=6, V=13.25
A |--------------<
B |------<
C |------<
---+-------+----*--+---

Note: 1 heavy and 1 short task -- combine them all.

Note: both the short and heavy task end up with a period of 4q

~~~

A(w=1,r=16) B(w=2,r=16) C(w=1,r=8)

BBCAABBC...
+---+---+---+---

t=0, V=1 t=2, V=5
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+*------+-------+--- ---+----*--+-------+---

t=3, V=7 t=5, V=11
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+------*+-------+--- ---+-------+--*----+---

t=7, V=15
A |--------------<
B |------<
C |------<
---+-------+------*+---

Note: as before but permuted

~~~

From all this it can be deduced that, for the steady state:

- the total period (P) of a schedule is: W*max(r_i/w_i)
- the average period of a task is: W*(r_i/w_i)
- each task obtains the fair share: w_i/W of each full period P

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Valentin Schneider <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 4ae0c2b9 01-Aug-2024 Dan Carpenter <[email protected]>

sched/debug: Fix fair_server_period_max value

This code has an integer overflow or sign extension bug which was caught
by gcc-13:

kernel/sched/debug.c:341:57: error: integer overflow in expressio

sched/debug: Fix fair_server_period_max value

This code has an integer overflow or sign extension bug which was caught
by gcc-13:

kernel/sched/debug.c:341:57: error: integer overflow in expression of
type 'long int' results in '-100663296' [-Werror=overflow]
341 | static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */

The result is that "fair_server_period_max" is set to 0xfffffffffa000000
(585 years) instead of instead of 0xfa000000 (4 seconds) that was
intended.

Fix this by changing the type to shift from (1 << 22) to (1UL << 22).

Closes: https://lore.kernel.org/all/CA+G9fYtE2GAbeqU+AOCffgo2oH0RTJUxU+=Pi3cFn4di_KgBAQ@mail.gmail.com/
Fixes: d741f297bcea ("sched/fair: Fair server interface")
Reported-by: Linux Kernel Functional Testing <[email protected]>
Reported-by: Arnd Bergmann <[email protected]>
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 5f6bd380 27-May-2024 Peter Zijlstra <[email protected]>

sched/rt: Remove default bandwidth control

Now that fair_server exists, we no longer need RT bandwidth control
unless RT_GROUP_SCHED.

Enable fair_server with parameters equivalent to RT throttling.

sched/rt: Remove default bandwidth control

Now that fair_server exists, we no longer need RT bandwidth control
unless RT_GROUP_SCHED.

Enable fair_server with parameters equivalent to RT throttling.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: "Peter Zijlstra (Intel)" <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: "Vineeth Pillai (Google)" <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org

show more ...


# d741f297 27-May-2024 Daniel Bristot de Oliveira <[email protected]>

sched/fair: Fair server interface

Add an interface for fair server setup on debugfs.

Each CPU has two files under /debug/sched/fair_server/cpu{ID}:

- runtime: set runtime in ns
- period: set pe

sched/fair: Fair server interface

Add an interface for fair server setup on debugfs.

Each CPU has two files under /debug/sched/fair_server/cpu{ID}:

- runtime: set runtime in ns
- period: set period in ns

This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to set
bounds on admission control.

The interface also add the server to the dl bandwidth accounting.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/a9ef9fc69bcedb44bddc9bc34f2b313296052819.1716811044.git.bristot@kernel.org

show more ...


# 2c2d9624 17-Jul-2024 Chuyi Zhou <[email protected]>

sched/fair: Remove cfs_rq::nr_spread_over and cfs_rq::exec_clock

nr_spread_over tracks the number of instances where the difference
between a scheduling entity's virtual runtime and the minimum virt

sched/fair: Remove cfs_rq::nr_spread_over and cfs_rq::exec_clock

nr_spread_over tracks the number of instances where the difference
between a scheduling entity's virtual runtime and the minimum virtual
runtime in the runqueue exceeds three times the scheduler latency,
indicating significant disparity in task scheduling.
Commit that removed its usage: 5e963f2bd: sched/fair: Commit to EEVDF

cfs_rq->exec_clock was used to account for time spent executing tasks.
Commit that removed its usage: 5d69eca542ee1 sched: Unify runtime
accounting across classes

cfs_rq::nr_spread_over and cfs_rq::exec_clock are not used anymore in
eevdf. Remove them from struct cfs_rq.

Signed-off-by: Chuyi Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Reviewed-by: K Prateek Nayak <[email protected]>
Acked-by: Vishal Chourasia <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# f0e1a064 18-Jun-2024 Tejun Heo <[email protected]>

sched_ext: Implement BPF extensible scheduler class

Implement a new scheduler class sched_ext (SCX), which allows scheduling
policies to be implemented as BPF programs to achieve the following:

1.

sched_ext: Implement BPF extensible scheduler class

Implement a new scheduler class sched_ext (SCX), which allows scheduling
policies to be implemented as BPF programs to achieve the following:

1. Ease of experimentation and exploration: Enabling rapid iteration of new
scheduling policies.

2. Customization: Building application-specific schedulers which implement
policies that are not applicable to general-purpose schedulers.

3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments.

sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is
struct sched_ext_ops, and is conceptually similar to struct sched_class. The
role of sched_ext is to map the complex sched_class callbacks to the more
simple and ergonomic struct sched_ext_ops callbacks.

For more detailed discussion on the motivations and overview, please refer
to the cover letter.

Later patches will also add several example schedulers and documentation.

This patch implements the minimum core framework to enable implementation of
BPF schedulers. Subsequent patches will gradually add functionalities
including safety guarantee mechanisms, nohz and cgroup support.

include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
top, each operation should be self-explanatory. The followings are worth
noting:

- Both "sched_ext" and its shorthand "scx" are used. If the identifier
already has "sched" in it, "ext" is used; otherwise, "scx".

- In sched_ext_ops, only .name is mandatory. Every operation is optional and
if omitted a simple but functional default behavior is provided.

- A new policy constant SCHED_EXT is added and a task can select sched_ext
by invoking sched_setscheduler(2) with the new policy constant. However,
if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
and the task is scheduled by CFS. When the BPF scheduler is loaded, all
tasks which have the SCHED_EXT policy are switched to sched_ext.

- To bridge the workflow imbalance between the scheduler core and
sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
convenience and need not be used by a scheduler that doesn't require it.
SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
the next task on the CPU. The BPF scheduler can manage an arbitrary number
of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().

- sched_ext guarantees system integrity no matter what the BPF scheduler
does. To enable this, each task's ownership is tracked through
p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
can always recover and revert all tasks back to CFS. See p->scx.ops_state
and scx_tasks.

- A task is not tied to its rq while enqueued. This decouples CPU selection
from queueing and allows sharing a scheduling queue across an arbitrary
subset of CPUs. This adds some complexities as a task may need to be
bounced between rq's right before it starts executing. See
dispatch_to_local_dsq() and move_task_to_local_dsq().

- One complication that arises from the above weak association between task
and rq is that synchronizing with dequeue() gets complicated as dequeue()
may happen anytime while the task is enqueued and the dispatch path might
need to release the rq lock to transfer the task. Solving this requires a
bit of complexity. See the logic around p->scx.sticky_cpu and
p->scx.ops_qseq.

- Both enable and disable paths are a bit complicated. The enable path
switches all tasks without blocking to avoid issues which can arise from
partially switched states (e.g. the switching task itself being starved).
The disable path can't trust the BPF scheduler at all, so it also has to
guarantee forward progress without blocking. See scx_ops_enable() and
scx_ops_disable_workfn().

- When sched_ext is disabled, static_branches are used to shut down the
entry points from hot paths.

v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab
scx_ops_enable_mutex which can lead to deadlocks in the disable path.
Fixed.

- Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead
to use-after-free.

- Consolidated per-cpu variable usages and other cleanups.

v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple
groups can be expressed. Later CPU hotplug operations are put into
their own group.

- SCX_OPS_DISABLING state is replaced with the new bypass mechanism
which allows temporarily putting the system into simple FIFO
scheduling mode bypassing the BPF scheduler. In addition to the shut
down path, this will also be used to isolate the BPF scheduler across
PM events. Enabling and disabling the bypass mode requires iterating
all runnable tasks. rq->scx.runnable_list addition is moved from the
later watchdog patch.

- ops.prep_enable() is replaced with ops.init_task() and
ops.enable/disable() are now called whenever the task enters and
leaves sched_ext instead of when the task becomes schedulable on
sched_ext and stops being so. A new operation - ops.exit_task() - is
called when the task stops being schedulable on sched_ext.

- scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
removes the need for communicating local dispatch decision made by
ops.select_cpu() to ops.enqueue() via per-task storage.
SCX_KF_SELECT_CPU is added to support the change.

- SCX_TASK_ENQ_LOCAL which told the BPF scheudler that
scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ
was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly
if it finds a suitable idle CPU. If such behavior is not desired,
users can use scx_bpf_select_cpu_dfl() which returns the verdict in a
bool out param.

- scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up
queueing many tasks on a local DSQ which makes tasks to execute in
order while other CPUs stay idle which made some hackbench numbers
really bad. Fixed.

- The current state of sched_ext can now be monitored through files
under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is
to enable monitoring on kernels which don't enable debugfs.

- sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may
be NULL and a BPF scheduler which derefs the pointer without checking
could crash the kernel. Tell BPF. This is currently a bit ugly. A
better way to annotate this is expected in the future.

- scx_exit_info updated to carry pointers to message buffers instead of
embedding them directly. This decouples buffer sizes from API so that
they can be changed without breaking compatibility.

- exit_code added to scx_exit_info. This is used to indicate different
exit conditions on non-error exits and will be used to handle e.g. CPU
hotplugs.

- The patch "sched_ext: Allow BPF schedulers to switch all eligible
tasks into sched_ext" is folded in and the interface is changed so
that partial switching is indicated with a new ops flag
%SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry
and in turn SCX_KF_INIT. ops.init() is now called with
SCX_KF_SLEEPABLE.

- Code reorganized so that only the parts necessary to integrate with
the rest of the kernel are in the header files.

- Changes to reflect the BPF and other kernel changes including the
addition of bpf_sched_ext_ops.cfi_stubs.

v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
load_acquire/store_release is now unsigned long instead of u64.

- Fix the bug where bpf_scx_btf_struct_access() was allowing write
access to arbitrary fields.

- Distinguish kfuncs which can be called from any sched_ext ops and from
anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
sched_ext ops.

- Rename "type" to "kind" in scx_exit_info to make it easier to use on
languages in which "type" is a reserved keyword.

- Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
setup"), PF_IDLE is not set on idle tasks which haven't been online
yet which made scx_task_iter_next_filtered() include those idle tasks
in iterations leading to oopses. Update scx_task_iter_next_filtered()
to directly test p->sched_class against idle_sched_class instead of
using is_idle_task() which tests PF_IDLE.

- Other updates to match upstream changes such as adding const to
set_cpumask() param and renaming check_preempt_curr() to
wakeup_preempt().

v4: - SCHED_CHANGE_BLOCK replaced with the previous
sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
because upstream is adaopting a different generic cleanup mechanism.
Once that lands, the code will be adapted accordingly.

- task_on_scx() used to test whether a task should be switched into SCX,
which is confusing. Renamed to task_should_scx(). task_on_scx() now
tests whether a task is currently on SCX.

- scx_has_idle_cpus is barely used anymore and replaced with direct
check on the idle cpumask.

- SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
fully idle cores.

- ops.enable() now sees up-to-date p->scx.weight value.

- ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
schedulers expecting ->select_cpu() call.

- Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
of the scheduler.

v3: - ops.set_weight() added to allow BPF schedulers to track weight changes
without polling p->scx.weight.

- move_task_to_local_dsq() was losing SCX-specific enq_flags when
enqueueing the task on the target dsq because it goes through
activate_task() which loses the upper 32bit of the flags. Carry the
flags through rq->scx.extra_enq_flags.

- scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
and scx_bpf_task_cpu() now use the new KF_RCU instead of
KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.

- The kfunc helper access control mechanism implemented through
sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
used when invoking scx_ops operations.

v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
To determine whether balance_scx() should be called from
put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
comment in put_prev_task_scx() for details.

- sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
with SCHED_CHANGE_BLOCK().

- Unused all_dsqs list removed. This was a left-over from previous
iterations.

- p->scx.kf_mask is added to track and enforce which kfunc helpers are
allowed. Also, init/exit sequences are updated to make some kfuncs
always safe to call regardless of the current BPF scheduler state.
Combined, this should make all the kfuncs safe.

- BPF now supports sleepable struct_ops operations. Hacky workaround
removed and operations and kfunc helpers are tagged appropriately.

- BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
and friends are added so that BPF schedulers can use the idle masks
with the generic helpers. This replaces the hacky kfunc helpers added
by a separate patch in V1.

- CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
enabled. This restriction will be removed by a later patch which adds
core-sched support.

- Add MAINTAINERS entries and other misc changes.

Signed-off-by: Tejun Heo <[email protected]>
Co-authored-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Cc: Andrea Righi <[email protected]>

show more ...


# 287372fa 30-Apr-2024 Vitalii Bursov <[email protected]>

sched/debug: Dump domains' level

Knowing domain's level exactly can be useful when setting
relax_domain_level or cpuset.sched_relax_domain_level

Usage:

cat /debug/sched/domains/cpu0/domain1/leve

sched/debug: Dump domains' level

Knowing domain's level exactly can be useful when setting
relax_domain_level or cpuset.sched_relax_domain_level

Usage:

cat /debug/sched/domains/cpu0/domain1/level

to dump cpu0 domain1's level.

SDM macro is not used because sd->level is 'int' and
it would hide the type mismatch between 'int' and 'u32'.

Signed-off-by: Vitalii Bursov <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Dietmar Eggemann <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Link: https://lore.kernel.org/r/9489b6475f6dd6fbc67c617752d4216fa094da53.1714488502.git.vitaly@bursov.com

show more ...


# 11137d38 01-Dec-2023 Vincent Guittot <[email protected]>

sched/fair: Simplify util_est

With UTIL_EST_FASTUP now being permanent, we can take advantage of the
fact that the ewma jumps directly to a higher utilization at dequeue to
simplify util_est and rem

sched/fair: Simplify util_est

With UTIL_EST_FASTUP now being permanent, we can take advantage of the
fact that the ewma jumps directly to a higher utilization at dequeue to
simplify util_est and remove the enqueued field.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Lukasz Luba <[email protected]>
Reviewed-by: Lukasz Luba <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Reviewed-by: Hongyan Xia <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 2227a957 15-Nov-2023 Abel Wu <[email protected]>

sched/eevdf: Sort the rbtree by virtual deadline

Sort the task timeline by virtual deadline and keep the min_vruntime
in the augmented tree, so we can avoid doubling the worst case cost
and make ful

sched/eevdf: Sort the rbtree by virtual deadline

Sort the task timeline by virtual deadline and keep the min_vruntime
in the augmented tree, so we can avoid doubling the worst case cost
and make full use of the cached leftmost node to enable O(1) fastpath
picking in next patch.

Signed-off-by: Abel Wu <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 5fe77659 28-Sep-2023 Valentin Schneider <[email protected]>

sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded

dl_rq->dl_nr_migratory is increased whenever a DL entity is enqueued and it has
nr_cpus_allowed > 1. Unlike the pushable_

sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded

dl_rq->dl_nr_migratory is increased whenever a DL entity is enqueued and it has
nr_cpus_allowed > 1. Unlike the pushable_dl_tasks tree, dl_rq->dl_nr_migratory
includes a dl_rq's current task. This means a dl_rq can have a migratable
current, N non-migratable queued tasks, and be flagged as overloaded and have
its CPU set in the dlo_mask, despite having an empty pushable_tasks tree.

Make an dl_rq's overload logic be driven by {enqueue,dequeue}_pushable_dl_task(),
in other words make DL RQs only be flagged as overloaded if they have at
least one runnable-but-not-current migratable task.

o push_dl_task() is unaffected, as it is a no-op if there are no pushable
tasks.

o pull_dl_task() now no longer scans runqueues whose sole migratable task is
their current one, which it can't do anything about anyway.
It may also now pull tasks to a DL RQ with dl_nr_running > 1 if only its
current task is migratable.

Since dl_rq->dl_nr_migratory becomes unused, remove it.

RT had the exact same mechanism (rt_rq->rt_nr_migratory) which was dropped
in favour of relying on rt_rq->pushable_tasks, see:

612f769edd06 ("sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask")

Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 612f769e 11-Aug-2023 Valentin Schneider <[email protected]>

sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask

Sebastian noted that the rto_push_work IRQ work can be queued for a CPU
that has an empty pushable_tasks list, which means nothing useful

sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask

Sebastian noted that the rto_push_work IRQ work can be queued for a CPU
that has an empty pushable_tasks list, which means nothing useful will be
done in the IPI other than queue the work for the next CPU on the rto_mask.

rto_push_irq_work_func() only operates on tasks in the pushable_tasks list,
but the conditions for that irq_work to be queued (and for a CPU to be
added to the rto_mask) rely on rq_rt->nr_migratory instead.

nr_migratory is increased whenever an RT task entity is enqueued and it has
nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a
rt_rq's current task. This means a rt_rq can have a migratible current, N
non-migratible queued tasks, and be flagged as overloaded / have its CPU
set in the rto_mask, despite having an empty pushable_tasks list.

Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task().
Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them.

Note that the case where the current task is pushed away to make way for a
migration-disabled task remains unchanged: the migration-disabled task has
to be in the pushable_tasks list in the first place, which means it has
nr_cpus_allowed > 1.

Reported-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


1234567