|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7 |
|
| #
56209334 |
| 13-Mar-2025 |
Juri Lelli <[email protected]> |
sched/topology: Wrappers for sched_domains_mutex
Create wrappers for sched_domains_mutex so that it can transparently be used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to do.
F
sched/topology: Wrappers for sched_domains_mutex
Create wrappers for sched_domains_mutex so that it can transparently be used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to do.
Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Tested-by: Waiman Long <[email protected]> Tested-by: Jon Hunter <[email protected]> Tested-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
8bdc5daa |
| 14-Mar-2025 |
Sebastian Andrzej Siewior <[email protected]> |
sched: Add a generic function to return the preemption string
The individual architectures often add the preemption model to the begin of the backtrace. This is the case on X86 or ARM64 for the "die
sched: Add a generic function to return the preemption string
The individual architectures often add the preemption model to the begin of the backtrace. This is the case on X86 or ARM64 for the "die" case but not for regular warning. With the addition of DYNAMIC_PREEMPT for PREEMPT_RT we end up with CONFIG_PREEMPT and CONFIG_PREEMPT_RT set simultaneously. That means that everyone who tried to add that piece of information gets it wrong for PREEMPT_RT because PREEMPT is checked first.
Provide a generic function which returns the current scheduling model considering LAZY preempt and the current state of PREEMPT_DYNAMIC.
The resulting strings are: ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ Model ┃ -RT -DYN ┃ +RT -DYN ┃ -RT +DYN ┃ +RT +DYN ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │NONE │ NONE │ n/a │ PREEMPT(none) │ n/a │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │VOLUNTARY │ VOLUNTARY │ n/a │ PREEMPT(voluntary) │ n/a │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │FULL │ PREEMPT │ PREEMPT_RT │ PREEMPT(full) │ PREEMPT_{RT,full} │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │LAZY │ PREEMPT_LAZY │ PREEMPT_{RT,LAZY} │ PREEMPT(lazy) │ PREEMPT_{RT,lazy} │ └───────────┴──────────────┴───────────────────┴────────────────────┴───────────────────┘
[ The dynamic building of the string can lead to an empty string if the function is invoked simultaneously on two CPUs. ]
Co-developed-by: "Peter Zijlstra (Intel)" <[email protected]> Signed-off-by: "Peter Zijlstra (Intel)" <[email protected]> Co-developed-by: "Steven Rostedt (Google)" <[email protected]> Signed-off-by: "Steven Rostedt (Google)" <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Shrikanth Hegde <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
9065ce69 |
| 29-Jan-2025 |
Christian Loehle <[email protected]> |
sched/debug: Provide slice length for fair tasks
Since commit:
857b158dc5e8 ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion")
... we have the userspace per-task tuna
sched/debug: Provide slice length for fair tasks
Since commit:
857b158dc5e8 ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion")
... we have the userspace per-task tunable slice length, which is a key parameter that is otherwise difficult to obtain, so provide it in /proc/$PID/sched.
[ mingo: Clarified the changelog. ]
Signed-off-by: Christian Loehle <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.13, v6.13-rc7 |
|
| #
8061b9f5 |
| 10-Jan-2025 |
David Rientjes <[email protected]> |
sched/debug: Change need_resched warnings to pr_err
need_resched warnings, if enabled, are treated as WARNINGs. If kernel.panic_on_warn is enabled, then this causes a kernel panic.
It's highly unl
sched/debug: Change need_resched warnings to pr_err
need_resched warnings, if enabled, are treated as WARNINGs. If kernel.panic_on_warn is enabled, then this causes a kernel panic.
It's highly unlikely that a panic is desired for these warnings, only a stack trace is normally required to debug and resolve.
Thus, switch need_resched warnings to simply be a printk with an associated stack trace so they are no longer in scope for panic_on_warn.
Signed-off-by: David Rientjes <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Madadi Vineeth Reddy <[email protected]> Acked-by: Josh Don <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2 |
|
| #
736c55a0 |
| 02-Dec-2024 |
Vincent Guittot <[email protected]> |
sched/fair: Rename cfs_rq.nr_running into nr_queued
Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the reality as the value includes both the ready to run tasks and the delayed
sched/fair: Rename cfs_rq.nr_running into nr_queued
Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the reality as the value includes both the ready to run tasks and the delayed dequeue tasks.
Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
43eef7c3 |
| 02-Dec-2024 |
Vincent Guittot <[email protected]> |
sched/fair: Remove unused cfs_rq.idle_nr_running
cfs_rq.idle_nr_running field is not used anywhere so we can remove the useless associated computation. Last user went in commit 5e963f2bd465 ("sched/
sched/fair: Remove unused cfs_rq.idle_nr_running
cfs_rq.idle_nr_running field is not used anywhere so we can remove the useless associated computation. Last user went in commit 5e963f2bd465 ("sched/fair: Commit to EEVDF").
Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
31898e7b |
| 02-Dec-2024 |
Vincent Guittot <[email protected]> |
sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
Use same naming convention as others starting with h_nr_* and rename idle_h_nr_running into h_nr_idle. The "running" is not correct anymore
sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
Use same naming convention as others starting with h_nr_* and rename idle_h_nr_running into h_nr_idle. The "running" is not correct anymore as it includes delayed dequeue tasks as well.
Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
9216582b |
| 02-Dec-2024 |
Vincent Guittot <[email protected]> |
sched/fair: Removed unsued cfs_rq.h_nr_delayed
h_nr_delayed is not used anymore. We now have: - h_nr_runnable which tracks tasks ready to run - h_nr_queued which tracks enqueued tasks either ready
sched/fair: Removed unsued cfs_rq.h_nr_delayed
h_nr_delayed is not used anymore. We now have: - h_nr_runnable which tracks tasks ready to run - h_nr_queued which tracks enqueued tasks either ready to run or delayed dequeue
Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
c2a295bf |
| 02-Dec-2024 |
Vincent Guittot <[email protected]> |
sched/fair: Add new cfs_rq.h_nr_runnable
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed. As a result, it stays also visible in the statisti
sched/fair: Add new cfs_rq.h_nr_runnable
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed. As a result, it stays also visible in the statistics that are used to balance the system and in particular the field cfs.h_nr_queued when the sched_entity is associated to a task.
Create a new h_nr_runnable that tracks only queued and runnable tasks.
Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
7b8a702d |
| 02-Dec-2024 |
Vincent Guittot <[email protected]> |
sched/fair: Rename h_nr_running into h_nr_queued
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed but can't run. Rename h_nr_running into h_n
sched/fair: Rename h_nr_running into h_nr_queued
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed but can't run. Rename h_nr_running into h_nr_queued to reflect this new behavior.
Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
76f2f783 |
| 02-Dec-2024 |
Peter Zijlstra <[email protected]> |
sched/eevdf: More PELT vs DELAYED_DEQUEUE
Vincent and Dietmar noted that while commit fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes the entity runnable stats, it does not adjust
sched/eevdf: More PELT vs DELAYED_DEQUEUE
Vincent and Dietmar noted that while commit fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes the entity runnable stats, it does not adjust the cfs_rq runnable stats, which are based off of h_nr_running.
Track h_nr_delayed such that we can discount those and adjust the signal.
Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") Closes: https://lore.kernel.org/lkml/[email protected]/ Closes: https://lore.kernel.org/lkml/CAKfTPtCNUvWE_GX5LyvTF-WdxUT=ZgvZZv-4t=eWntg5uOFqiQ@mail.gmail.com/ [ Fixes checkpatch warnings and rebased ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reported-by: Dietmar Eggemann <[email protected]> Reported-by: Vincent Guittot <[email protected]> Signed-off-by: "Peter Zijlstra (Intel)" <[email protected]> Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Tested-by: K Prateek Nayak <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
35772d62 |
| 04-Oct-2024 |
Peter Zijlstra <[email protected]> |
sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
In order to enable PREEMPT_DYNAMIC for PREEMPT_RT, remove PREEMPT_RT from the 'Preemption Model' choice. Strictly speaking PREEMPT_RT is not a change in
sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
In order to enable PREEMPT_DYNAMIC for PREEMPT_RT, remove PREEMPT_RT from the 'Preemption Model' choice. Strictly speaking PREEMPT_RT is not a change in how preemption works, but rather it makes a ton more code preemptible.
Notably, take away NONE and VOLUNTARY options for PREEMPT_RT, they make no sense (but are techincally possible).
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
7c70cb94 |
| 04-Oct-2024 |
Peter Zijlstra <[email protected]> |
sched: Add Lazy preemption model
Change fair to use resched_curr_lazy(), which, when the lazy preemption model is selected, will set TIF_NEED_RESCHED_LAZY.
This LAZY bit will be promoted to the ful
sched: Add Lazy preemption model
Change fair to use resched_curr_lazy(), which, when the lazy preemption model is selected, will set TIF_NEED_RESCHED_LAZY.
This LAZY bit will be promoted to the full NEED_RESCHED bit on tick. As such, the average delay between setting LAZY and actually rescheduling will be TICK_NSEC/2.
In short, Lazy preemption will delay preemption for fair class but will function as Full preemption for all the other classes, most notably the realtime (RR/FIFO/DEADLINE) classes.
The goal is to bridge the performance gap with Voluntary, such that we might eventually remove that option entirely.
Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7 |
|
| #
2cab4bd0 |
| 06-Sep-2024 |
Huang Shijie <[email protected]> |
sched/debug: Fix the runnable tasks output
The current runnable tasks output looks like:
runnable tasks: S task PID tree-key switches prio wait-time sum-
sched/debug: Fix the runnable tasks output
The current runnable tasks output looks like:
runnable tasks: S task PID tree-key switches prio wait-time sum-exec sum-sleep ------------------------------------------------------------------------------------------------------------- Ikworker/R-rcu_g 4 0.129049 E 0.620179 0.750000 0.002920 2 100 0.000000 0.002920 0.000000 0.000000 0 0 / Ikworker/R-sync_ 5 0.125328 E 0.624147 0.750000 0.001840 2 100 0.000000 0.001840 0.000000 0.000000 0 0 / Ikworker/R-slub_ 6 0.120835 E 0.628680 0.750000 0.001800 2 100 0.000000 0.001800 0.000000 0.000000 0 0 / Ikworker/R-netns 7 0.114294 E 0.634701 0.750000 0.002400 2 100 0.000000 0.002400 0.000000 0.000000 0 0 / I kworker/0:1 9 508.781746 E 511.754666 3.000000 151.575240 224 120 0.000000 151.575240 0.000000 0.000000 0 0 /
Which is messy. Remove the duplicate printing of sum_exec_runtime and tidy up the layout to make it look like:
runnable tasks: S task PID vruntime eligible deadline slice sum-exec switches prio wait-time sum-sleep sum-block node group-id group-path ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- I kworker/0:3 1698 295.001459 E 297.977619 3.000000 38.862920 9 120 0.000000 0.000000 0.000000 0 0 / I kworker/0:4 1702 278.026303 E 281.026303 3.000000 9.918760 3 120 0.000000 0.000000 0.000000 0 0 / S NetworkManager 2646 0.377936 E 2.598104 3.000000 98.535880 314 120 0.000000 0.000000 0.000000 0 0 /system.slice/NetworkManager.service S virtqemud 2689 0.541016 E 2.440104 3.000000 50.967960 80 120 0.000000 0.000000 0.000000 0 0 /system.slice/virtqemud.service S gsd-smartcard 3058 73.604144 E 76.475904 3.000000 74.033320 88 120 0.000000 0.000000 0.000000 0 0 /user.slice/user-42.slice/session-c1.scope
Reviewed-by: Christoph Lameter (Ampere) <[email protected]> Signed-off-by: Huang Shijie <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4 |
|
| #
857b158d |
| 22-May-2023 |
Peter Zijlstra <[email protected]> |
sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion
Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime.
The implementation cl
sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion
Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime.
The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms] which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.
Applications should strive to use their periodic runtime at a high confidence interval (95%+) as the target slice. Using a smaller slice will introduce undue preemptions, while using a larger value will increase latency.
For all the following examples assume a scheduling quantum of 8, and for consistency all examples have W=4:
{A,B,C,D}(w=1,r=8):
ABCD... +---+---+---+---
t=0, V=1.5 t=1, V=3.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+*------+-------+--- ---+--*----+-------+---
t=2, V=5.5 t=3, V=7.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+----*--+-------+--- ---+------*+-------+---
Note: 4 identical tasks in FIFO order
~~~
{A,B}(w=1,r=16) C(w=2,r=16)
AACCBBCC... +---+---+---+---
t=0, V=1.25 t=2, V=5.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+---
t=4, V=8.25 t=6, V=12.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+-------*-------+--- ---+-------+---*---+---
Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2 task doesn't go below q.
Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length.
Note: the period of the heavy task is half the full period at: W*(r_i/w_i) = 4*(2q/2) = 4q
~~~
{A,C,D}(w=1,r=16) B(w=1,r=8):
BAACCBDD... +---+---+---+---
t=0, V=1.5 t=1, V=3.5 A |--------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+*------+-------+--- ---+--*----+-------+---
t=3, V=7.5 t=5, V=11.5 A |---------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+------*+-------+--- ---+-------+--*----+---
t=6, V=13.5 A |---------------< B |------< C |--------------< D |--------------< ---+-------+----*--+---
Note: 1 short task -- again double r so that the deadline of the short task won't be below q. Made B short because its not the leftmost task, but is eligible with the 0,1,2,3 spread.
Note: like with the heavy task, the period of the short task observes: W*(r_i/w_i) = 4*(1q/1) = 4q
~~~
A(w=1,r=16) B(w=1,r=8) C(w=2,r=16)
BCCAABCC... +---+---+---+---
t=0, V=1.25 t=1, V=3.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+--*----+-------+---
t=3, V=7.25 t=5, V=11.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+---
t=6, V=13.25 A |--------------< B |------< C |------< ---+-------+----*--+---
Note: 1 heavy and 1 short task -- combine them all.
Note: both the short and heavy task end up with a period of 4q
~~~
A(w=1,r=16) B(w=2,r=16) C(w=1,r=8)
BBCAABBC... +---+---+---+---
t=0, V=1 t=2, V=5 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+---
t=3, V=7 t=5, V=11 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+---
t=7, V=15 A |--------------< B |------< C |------< ---+-------+------*+---
Note: as before but permuted
~~~
From all this it can be deduced that, for the steady state:
- the total period (P) of a schedule is: W*max(r_i/w_i) - the average period of a task is: W*(r_i/w_i) - each task obtains the fair share: w_i/W of each full period P
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
4ae0c2b9 |
| 01-Aug-2024 |
Dan Carpenter <[email protected]> |
sched/debug: Fix fair_server_period_max value
This code has an integer overflow or sign extension bug which was caught by gcc-13:
kernel/sched/debug.c:341:57: error: integer overflow in expressio
sched/debug: Fix fair_server_period_max value
This code has an integer overflow or sign extension bug which was caught by gcc-13:
kernel/sched/debug.c:341:57: error: integer overflow in expression of type 'long int' results in '-100663296' [-Werror=overflow] 341 | static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
The result is that "fair_server_period_max" is set to 0xfffffffffa000000 (585 years) instead of instead of 0xfa000000 (4 seconds) that was intended.
Fix this by changing the type to shift from (1 << 22) to (1UL << 22).
Closes: https://lore.kernel.org/all/CA+G9fYtE2GAbeqU+AOCffgo2oH0RTJUxU+=Pi3cFn4di_KgBAQ@mail.gmail.com/ Fixes: d741f297bcea ("sched/fair: Fair server interface") Reported-by: Linux Kernel Functional Testing <[email protected]> Reported-by: Arnd Bergmann <[email protected]> Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
5f6bd380 |
| 27-May-2024 |
Peter Zijlstra <[email protected]> |
sched/rt: Remove default bandwidth control
Now that fair_server exists, we no longer need RT bandwidth control unless RT_GROUP_SCHED.
Enable fair_server with parameters equivalent to RT throttling.
sched/rt: Remove default bandwidth control
Now that fair_server exists, we no longer need RT bandwidth control unless RT_GROUP_SCHED.
Enable fair_server with parameters equivalent to RT throttling.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: "Peter Zijlstra (Intel)" <[email protected]> Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: "Vineeth Pillai (Google)" <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
show more ...
|
| #
d741f297 |
| 27-May-2024 |
Daniel Bristot de Oliveira <[email protected]> |
sched/fair: Fair server interface
Add an interface for fair server setup on debugfs.
Each CPU has two files under /debug/sched/fair_server/cpu{ID}:
- runtime: set runtime in ns - period: set pe
sched/fair: Fair server interface
Add an interface for fair server setup on debugfs.
Each CPU has two files under /debug/sched/fair_server/cpu{ID}:
- runtime: set runtime in ns - period: set period in ns
This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to set bounds on admission control.
The interface also add the server to the dl bandwidth accounting.
Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lore.kernel.org/r/a9ef9fc69bcedb44bddc9bc34f2b313296052819.1716811044.git.bristot@kernel.org
show more ...
|
| #
2c2d9624 |
| 17-Jul-2024 |
Chuyi Zhou <[email protected]> |
sched/fair: Remove cfs_rq::nr_spread_over and cfs_rq::exec_clock
nr_spread_over tracks the number of instances where the difference between a scheduling entity's virtual runtime and the minimum virt
sched/fair: Remove cfs_rq::nr_spread_over and cfs_rq::exec_clock
nr_spread_over tracks the number of instances where the difference between a scheduling entity's virtual runtime and the minimum virtual runtime in the runqueue exceeds three times the scheduler latency, indicating significant disparity in task scheduling. Commit that removed its usage: 5e963f2bd: sched/fair: Commit to EEVDF
cfs_rq->exec_clock was used to account for time spent executing tasks. Commit that removed its usage: 5d69eca542ee1 sched: Unify runtime accounting across classes
cfs_rq::nr_spread_over and cfs_rq::exec_clock are not used anymore in eevdf. Remove them from struct cfs_rq.
Signed-off-by: Chuyi Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: K Prateek Nayak <[email protected]> Acked-by: Vishal Chourasia <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
f0e1a064 |
| 18-Jun-2024 |
Tejun Heo <[email protected]> |
sched_ext: Implement BPF extensible scheduler class
Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following:
1.
sched_ext: Implement BPF extensible scheduler class
Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following:
1. Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies.
2. Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers.
3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments.
sched_ext leverages BPF’s struct_ops feature to define a structure which exports function callbacks and flags to BPF programs that wish to implement scheduling policies. The struct_ops structure exported by sched_ext is struct sched_ext_ops, and is conceptually similar to struct sched_class. The role of sched_ext is to map the complex sched_class callbacks to the more simple and ergonomic struct sched_ext_ops callbacks.
For more detailed discussion on the motivations and overview, please refer to the cover letter.
Later patches will also add several example schedulers and documentation.
This patch implements the minimum core framework to enable implementation of BPF schedulers. Subsequent patches will gradually add functionalities including safety guarantee mechanisms, nohz and cgroup support.
include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on top, each operation should be self-explanatory. The followings are worth noting:
- Both "sched_ext" and its shorthand "scx" are used. If the identifier already has "sched" in it, "ext" is used; otherwise, "scx".
- In sched_ext_ops, only .name is mandatory. Every operation is optional and if omitted a simple but functional default behavior is provided.
- A new policy constant SCHED_EXT is added and a task can select sched_ext by invoking sched_setscheduler(2) with the new policy constant. However, if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL and the task is scheduled by CFS. When the BPF scheduler is loaded, all tasks which have the SCHED_EXT policy are switched to sched_ext.
- To bridge the workflow imbalance between the scheduler core and sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be used by a scheduler that doesn't require it. SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting the next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
- sched_ext guarantees system integrity no matter what the BPF scheduler does. To enable this, each task's ownership is tracked through p->scx.ops_state and all tasks are put on scx_tasks list. The disable path can always recover and revert all tasks back to CFS. See p->scx.ops_state and scx_tasks.
- A task is not tied to its rq while enqueued. This decouples CPU selection from queueing and allows sharing a scheduling queue across an arbitrary subset of CPUs. This adds some complexities as a task may need to be bounced between rq's right before it starts executing. See dispatch_to_local_dsq() and move_task_to_local_dsq().
- One complication that arises from the above weak association between task and rq is that synchronizing with dequeue() gets complicated as dequeue() may happen anytime while the task is enqueued and the dispatch path might need to release the rq lock to transfer the task. Solving this requires a bit of complexity. See the logic around p->scx.sticky_cpu and p->scx.ops_qseq.
- Both enable and disable paths are a bit complicated. The enable path switches all tasks without blocking to avoid issues which can arise from partially switched states (e.g. the switching task itself being starved). The disable path can't trust the BPF scheduler at all, so it also has to guarantee forward progress without blocking. See scx_ops_enable() and scx_ops_disable_workfn().
- When sched_ext is disabled, static_branches are used to shut down the entry points from hot paths.
v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab scx_ops_enable_mutex which can lead to deadlocks in the disable path. Fixed.
- Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead to use-after-free.
- Consolidated per-cpu variable usages and other cleanups.
v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple groups can be expressed. Later CPU hotplug operations are put into their own group.
- SCX_OPS_DISABLING state is replaced with the new bypass mechanism which allows temporarily putting the system into simple FIFO scheduling mode bypassing the BPF scheduler. In addition to the shut down path, this will also be used to isolate the BPF scheduler across PM events. Enabling and disabling the bypass mode requires iterating all runnable tasks. rq->scx.runnable_list addition is moved from the later watchdog patch.
- ops.prep_enable() is replaced with ops.init_task() and ops.enable/disable() are now called whenever the task enters and leaves sched_ext instead of when the task becomes schedulable on sched_ext and stops being so. A new operation - ops.exit_task() - is called when the task stops being schedulable on sched_ext.
- scx_bpf_dispatch() can now be called from ops.select_cpu() too. This removes the need for communicating local dispatch decision made by ops.select_cpu() to ops.enqueue() via per-task storage. SCX_KF_SELECT_CPU is added to support the change.
- SCX_TASK_ENQ_LOCAL which told the BPF scheudler that scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable idle CPU. If such behavior is not desired, users can use scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param.
- scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up queueing many tasks on a local DSQ which makes tasks to execute in order while other CPUs stay idle which made some hackbench numbers really bad. Fixed.
- The current state of sched_ext can now be monitored through files under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable monitoring on kernels which don't enable debugfs.
- sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may be NULL and a BPF scheduler which derefs the pointer without checking could crash the kernel. Tell BPF. This is currently a bit ugly. A better way to annotate this is expected in the future.
- scx_exit_info updated to carry pointers to message buffers instead of embedding them directly. This decouples buffer sizes from API so that they can be changed without breaking compatibility.
- exit_code added to scx_exit_info. This is used to indicate different exit conditions on non-error exits and will be used to handle e.g. CPU hotplugs.
- The patch "sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext" is folded in and the interface is changed so that partial switching is indicated with a new ops flag %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry and in turn SCX_KF_INIT. ops.init() is now called with SCX_KF_SLEEPABLE.
- Code reorganized so that only the parts necessary to integrate with the rest of the kernel are in the header files.
- Changes to reflect the BPF and other kernel changes including the addition of bpf_sched_ext_ops.cfi_stubs.
v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t instead of atomic64_t and scx_dsp_buf_ent.qseq which uses load_acquire/store_release is now unsigned long instead of u64.
- Fix the bug where bpf_scx_btf_struct_access() was allowing write access to arbitrary fields.
- Distinguish kfuncs which can be called from any sched_ext ops and from anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from sched_ext ops.
- Rename "type" to "kind" in scx_exit_info to make it easier to use on languages in which "type" is a reserved keyword.
- Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup"), PF_IDLE is not set on idle tasks which haven't been online yet which made scx_task_iter_next_filtered() include those idle tasks in iterations leading to oopses. Update scx_task_iter_next_filtered() to directly test p->sched_class against idle_sched_class instead of using is_idle_task() which tests PF_IDLE.
- Other updates to match upstream changes such as adding const to set_cpumask() param and renaming check_preempt_curr() to wakeup_preempt().
v4: - SCHED_CHANGE_BLOCK replaced with the previous sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is because upstream is adaopting a different generic cleanup mechanism. Once that lands, the code will be adapted accordingly.
- task_on_scx() used to test whether a task should be switched into SCX, which is confusing. Renamed to task_should_scx(). task_on_scx() now tests whether a task is currently on SCX.
- scx_has_idle_cpus is barely used anymore and replaced with direct check on the idle cpumask.
- SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer fully idle cores.
- ops.enable() now sees up-to-date p->scx.weight value.
- ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF schedulers expecting ->select_cpu() call.
- Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest of the scheduler.
v3: - ops.set_weight() added to allow BPF schedulers to track weight changes without polling p->scx.weight.
- move_task_to_local_dsq() was losing SCX-specific enq_flags when enqueueing the task on the target dsq because it goes through activate_task() which loses the upper 32bit of the flags. Carry the flags through rq->scx.extra_enq_flags.
- scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running() and scx_bpf_task_cpu() now use the new KF_RCU instead of KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.
- The kfunc helper access control mechanism implemented through sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always used when invoking scx_ops operations.
v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is called from put_prev_taks_scx() and pick_next_task_scx() as necessary. To determine whether balance_scx() should be called from put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the comment in put_prev_task_scx() for details.
- sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced with SCHED_CHANGE_BLOCK().
- Unused all_dsqs list removed. This was a left-over from previous iterations.
- p->scx.kf_mask is added to track and enforce which kfunc helpers are allowed. Also, init/exit sequences are updated to make some kfuncs always safe to call regardless of the current BPF scheduler state. Combined, this should make all the kfuncs safe.
- BPF now supports sleepable struct_ops operations. Hacky workaround removed and operations and kfunc helpers are tagged appropriately.
- BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask() and friends are added so that BPF schedulers can use the idle masks with the generic helpers. This replaces the hacky kfunc helpers added by a separate patch in V1.
- CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is enabled. This restriction will be removed by a later patch which adds core-sched support.
- Add MAINTAINERS entries and other misc changes.
Signed-off-by: Tejun Heo <[email protected]> Co-authored-by: David Vernet <[email protected]> Acked-by: Josh Don <[email protected]> Acked-by: Hao Luo <[email protected]> Acked-by: Barret Rhoden <[email protected]> Cc: Andrea Righi <[email protected]>
show more ...
|
| #
287372fa |
| 30-Apr-2024 |
Vitalii Bursov <[email protected]> |
sched/debug: Dump domains' level
Knowing domain's level exactly can be useful when setting relax_domain_level or cpuset.sched_relax_domain_level
Usage:
cat /debug/sched/domains/cpu0/domain1/leve
sched/debug: Dump domains' level
Knowing domain's level exactly can be useful when setting relax_domain_level or cpuset.sched_relax_domain_level
Usage:
cat /debug/sched/domains/cpu0/domain1/level
to dump cpu0 domain1's level.
SDM macro is not used because sd->level is 'int' and it would hide the type mismatch between 'int' and 'u32'.
Signed-off-by: Vitalii Bursov <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Dietmar Eggemann <[email protected]> Acked-by: Vincent Guittot <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lore.kernel.org/r/9489b6475f6dd6fbc67c617752d4216fa094da53.1714488502.git.vitaly@bursov.com
show more ...
|
| #
11137d38 |
| 01-Dec-2023 |
Vincent Guittot <[email protected]> |
sched/fair: Simplify util_est
With UTIL_EST_FASTUP now being permanent, we can take advantage of the fact that the ewma jumps directly to a higher utilization at dequeue to simplify util_est and rem
sched/fair: Simplify util_est
With UTIL_EST_FASTUP now being permanent, we can take advantage of the fact that the ewma jumps directly to a higher utilization at dequeue to simplify util_est and remove the enqueued field.
Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Lukasz Luba <[email protected]> Reviewed-by: Lukasz Luba <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Reviewed-by: Hongyan Xia <[email protected]> Reviewed-by: Alex Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
2227a957 |
| 15-Nov-2023 |
Abel Wu <[email protected]> |
sched/eevdf: Sort the rbtree by virtual deadline
Sort the task timeline by virtual deadline and keep the min_vruntime in the augmented tree, so we can avoid doubling the worst case cost and make ful
sched/eevdf: Sort the rbtree by virtual deadline
Sort the task timeline by virtual deadline and keep the min_vruntime in the augmented tree, so we can avoid doubling the worst case cost and make full use of the cached leftmost node to enable O(1) fastpath picking in next patch.
Signed-off-by: Abel Wu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
5fe77659 |
| 28-Sep-2023 |
Valentin Schneider <[email protected]> |
sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded
dl_rq->dl_nr_migratory is increased whenever a DL entity is enqueued and it has nr_cpus_allowed > 1. Unlike the pushable_
sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded
dl_rq->dl_nr_migratory is increased whenever a DL entity is enqueued and it has nr_cpus_allowed > 1. Unlike the pushable_dl_tasks tree, dl_rq->dl_nr_migratory includes a dl_rq's current task. This means a dl_rq can have a migratable current, N non-migratable queued tasks, and be flagged as overloaded and have its CPU set in the dlo_mask, despite having an empty pushable_tasks tree.
Make an dl_rq's overload logic be driven by {enqueue,dequeue}_pushable_dl_task(), in other words make DL RQs only be flagged as overloaded if they have at least one runnable-but-not-current migratable task.
o push_dl_task() is unaffected, as it is a no-op if there are no pushable tasks.
o pull_dl_task() now no longer scans runqueues whose sole migratable task is their current one, which it can't do anything about anyway. It may also now pull tasks to a DL RQ with dl_nr_running > 1 if only its current task is migratable.
Since dl_rq->dl_nr_migratory becomes unused, remove it.
RT had the exact same mechanism (rt_rq->rt_nr_migratory) which was dropped in favour of relying on rt_rq->pushable_tasks, see:
612f769edd06 ("sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask")
Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Juri Lelli <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
612f769e |
| 11-Aug-2023 |
Valentin Schneider <[email protected]> |
sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask
Sebastian noted that the rto_push_work IRQ work can be queued for a CPU that has an empty pushable_tasks list, which means nothing useful
sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask
Sebastian noted that the rto_push_work IRQ work can be queued for a CPU that has an empty pushable_tasks list, which means nothing useful will be done in the IPI other than queue the work for the next CPU on the rto_mask.
rto_push_irq_work_func() only operates on tasks in the pushable_tasks list, but the conditions for that irq_work to be queued (and for a CPU to be added to the rto_mask) rely on rq_rt->nr_migratory instead.
nr_migratory is increased whenever an RT task entity is enqueued and it has nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a rt_rq's current task. This means a rt_rq can have a migratible current, N non-migratible queued tasks, and be flagged as overloaded / have its CPU set in the rto_mask, despite having an empty pushable_tasks list.
Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task(). Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them.
Note that the case where the current task is pushed away to make way for a migration-disabled task remains unchanged: the migration-disabled task has to be in the pushable_tasks list in the first place, which means it has nr_cpus_allowed > 1.
Reported-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|