|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
f7d2728c |
| 17-Mar-2025 |
Ingo Molnar <[email protected]> |
sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG.
Since CONFIG_SCHED_DEBUG is getting removed, convert S
sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG.
Since CONFIG_SCHED_DEBUG is getting removed, convert SCHED_WARN() to WARN_ON_ONCE().
Note that the warning output isn't 100% equivalent:
#define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
Because SCHED_WARN_ON() would output the 'x' condition as well, while WARN_ONCE() will only show a backtrace.
Hopefully these are rare enough to not really matter.
If it does, we should probably introduce a new WARN_ON() variant that outputs the condition in stringified form, or improve WARN_ON() itself.
Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Shrikanth Hegde <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Ben Segall <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5 |
|
| #
7d9da040 |
| 27-Dec-2024 |
Chengming Zhou <[email protected]> |
psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
When running hackbench in a cgroup with bandwidth throttling enabled, following PSI splat was observed:
psi: inconsisten
psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
When running hackbench in a cgroup with bandwidth throttling enabled, following PSI splat was observed:
psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
When investigating the series of events leading up to the splat, following sequence was observed:
[008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120 ... [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0 [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8 # CPU8 goes into newidle balance and releases the rq lock ... # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831) [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1) [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags for the blocked entity, however, with the introduction of DELAY_DEQUEUE, the block task can wakeup when newidle balance drops the runqueue lock during __schedule().
If a task wakes before psi_sched_switch() adjusts the PSI flags, skip any modifications in psi_enqueue() which would still see the flags of a running task and not a blocked one. Instead, rely on psi_sched_switch() to do the right thing.
Since the status returned by try_to_block_task() may no longer be true by the time schedule reaches psi_sched_switch(), check if the task is blocked or not using a combination of task_on_rq_queued() and p->se.sched_delayed checks.
[ prateek: Commit message, testing, early bailout in psi_enqueue() ]
Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") # 1a6151017ee5 Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: K Prateek Nayak <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.13-rc4 |
|
| #
f65c64f3 |
| 20-Dec-2024 |
Wang Yaxin <[email protected]> |
delayacct: add delay min to record delay peak
Delay accounting can now calculate the average delay of processes, detect the overall system load, and also record the 'delay max' to identify potential
delayacct: add delay min to record delay peak
Delay accounting can now calculate the average delay of processes, detect the overall system load, and also record the 'delay max' to identify potential abnormal delays. However, 'delay min' can help us identify another useful delay peak. By comparing the difference between 'delay max' and 'delay min', we can understand the optimization space for latency, providing a reference for the optimization of latency performance.
Use case ========= bash-4.4# ./getdelays -d -t 242 print delayacct stats ON TGID 242 CPU count real total virtual total delay total delay average delay max delay min 39 156000000 156576579 2111069 0.054ms 0.212296ms 0.031307ms IO count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms SWAP count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms RECLAIM count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms THRASHING count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms COMPACT count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms WPCOPY count delay total delay average delay max delay min 156 11215873 0.072ms 0.207403ms 0.033913ms IRQ count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms
Link: https://lkml.kernel.org/r/[email protected] Co-developed-by: Wang Yong <[email protected]> Signed-off-by: Wang Yong <[email protected]> Co-developed-by: xu xin <[email protected]> Signed-off-by: xu xin <[email protected]> Signed-off-by: Wang Yaxin <[email protected]> Co-developed-by: Kun Jiang <[email protected]> Signed-off-by: Kun Jiang <[email protected]> Cc: Balbir Singh <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fan Yu <[email protected]> Cc: Peilin He <[email protected]> Cc: tuqiang <[email protected]> Cc: ye xingchen <[email protected]> Cc: Yunkai Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc3, v6.13-rc2 |
|
| #
658eb5ab |
| 03-Dec-2024 |
Wang Yaxin <[email protected]> |
delayacct: add delay max to record delay peak
Introduce the use cases of delay max, which can help quickly detect potential abnormal delays in the system and record the types and specific details of
delayacct: add delay max to record delay peak
Introduce the use cases of delay max, which can help quickly detect potential abnormal delays in the system and record the types and specific details of delay spikes.
Problem ======== Delay accounting can track the average delay of processes to show system workload. However, when a process experiences a significant delay, maybe a delay spike, which adversely affects performance, getdelays can only display the average system delay over a period of time. Yet, average delay is unhelpful for diagnosing delay peak. It is not even possible to determine which type of delay has spiked, as this information might be masked by the average delay.
Solution ========= the 'delay max' can display delay peak since the system's startup, which can record potential abnormal delays over time, including the type of delay and the maximum delay. This is helpful for quickly identifying crash caused by delay.
Use case ========= bash# ./getdelays -d -p 244 print delayacct stats ON PID 244
CPU count real total virtual total delay total delay average delay max 68 192000000 213676651 705643 0.010ms 0.306381ms IO count delay total delay average delay max 0 0 0.000ms 0.000000ms SWAP count delay total delay average delay max 0 0 0.000ms 0.000000ms RECLAIM count delay total delay average delay max 0 0 0.000ms 0.000000ms THRASHING count delay total delay average delay max 0 0 0.000ms 0.000000ms COMPACT count delay total delay average delay max 0 0 0.000ms 0.000000ms WPCOPY count delay total delay average delay max 235 15648284 0.067ms 0.263842ms IRQ count delay total delay average delay max 0 0 0.000ms 0.000000ms
[[email protected]: update docs and fix some spelling errors] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Co-developed-by: Wang Yong <[email protected]> Signed-off-by: Wang Yong <[email protected]> Co-developed-by: xu xin <[email protected]> Signed-off-by: xu xin <[email protected]> Co-developed-by: Wang Yaxin <[email protected]> Signed-off-by: Wang Yaxin <[email protected]> Signed-off-by: Kun Jiang <[email protected]> Cc: Balbir Singh <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fan Yu <[email protected]> Cc: Peilin He <[email protected]> Cc: tuqiang <[email protected]> Cc: Yang Yang <[email protected]> Cc: ye xingchen <[email protected]> Cc: Yunkai Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4 |
|
| #
1a615101 |
| 14-Oct-2024 |
Johannes Weiner <[email protected]> |
sched: psi: pass enqueue/dequeue flags to psi callbacks directly
What psi needs to do on each enqueue and dequeue has gotten more subtle, and the generic sched code trying to distill this into a boo
sched: psi: pass enqueue/dequeue flags to psi callbacks directly
What psi needs to do on each enqueue and dequeue has gotten more subtle, and the generic sched code trying to distill this into a bool for the callbacks is awkward.
Pass the flags directly and let psi parse them. For that to work, the #include "stats.h" (which has the psi callback implementations) needs to be below the flag definitions in "sched.h". Move that section further down, next to some of the other accounting stuff.
This also puts the ENQUEUE_SAVE/RESTORE branch behind the psi jump label, slightly reducing overhead when PSI=y but runtime disabled.
Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.12-rc3 |
|
| #
c6508124 |
| 11-Oct-2024 |
Johannes Weiner <[email protected]> |
sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
Since sched_delayed tasks remain queued even after blocking, the load balancer can migrate them between runqueues while
sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
Since sched_delayed tasks remain queued even after blocking, the load balancer can migrate them between runqueues while PSI considers them to be asleep. As a result, it misreads the migration requeue followed by a wakeup as a double queue:
psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=. set=4
First, call psi_enqueue() after p->sched_class->enqueue_task(). A wakeup will clear p->se.sched_delayed while a migration will not, so psi can use that flag to tell them apart.
Then teach psi to migrate any "sleep" state when delayed-dequeue tasks are being migrated.
Delayed-dequeue tasks can be revived by ttwu_runnable(), which will call down with a new ENQUEUE_DELAYED. Instead of further complicating the wakeup conditional in enqueue_task(), identify migration contexts instead and default to wakeup handling for all other cases.
It's not just the warning in dmesg, the task state corruption causes a permanent CPU pressure indication, which messes with workload/machine health monitoring.
Debugged-by-and-original-fix-by: K Prateek Nayak <[email protected]> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Closes: https://lore.kernel.org/lkml/[email protected]/ Closes: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: K Prateek Nayak <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5 |
|
| #
ddae0ca2 |
| 18-Jun-2024 |
John Stultz <[email protected]> |
sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
It was reported that in moving to 6.1, a larger then 10% regression was seen in the performance of clock_gettime(CLOCK_THREAD_
sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
It was reported that in moving to 6.1, a larger then 10% regression was seen in the performance of clock_gettime(CLOCK_THREAD_CPUTIME_ID,...).
Using a simple reproducer, I found: 5.10: 100000000 calls in 24345994193 ns => 243.460 ns per call 100000000 calls in 24288172050 ns => 242.882 ns per call 100000000 calls in 24289135225 ns => 242.891 ns per call
6.1: 100000000 calls in 28248646742 ns => 282.486 ns per call 100000000 calls in 28227055067 ns => 282.271 ns per call 100000000 calls in 28177471287 ns => 281.775 ns per call
The cause of this was finally narrowed down to the addition of psi_account_irqtime() in update_rq_clock_task(), in commit 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure").
In my initial attempt to resolve this, I leaned towards moving all accounting work out of the clock_gettime() call path, but it wasn't very pretty, so it will have to wait for a later deeper rework. Instead, Peter shared this approach:
Rework psi_account_irqtime() to use its own psi_irq_time base for accounting, and move it out of the hotpath, calling it instead from sched_tick() and __schedule().
In testing this, we found the importance of ensuring psi_account_irqtime() is run under the rq_lock, which Johannes Weiner helpfully explained, so also add some lockdep annotations to make that requirement clear.
With this change the performance is back in-line with 5.10: 6.1+fix: 100000000 calls in 24297324597 ns => 242.973 ns per call 100000000 calls in 24318869234 ns => 243.189 ns per call 100000000 calls in 24291564588 ns => 242.916 ns per call
Reported-by: Jimmy Shiu <[email protected]> Originally-by: Peter Zijlstra <[email protected]> Signed-off-by: John Stultz <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Qais Yousef <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
402de7fc |
| 27-May-2024 |
Ingo Molnar <[email protected]> |
sched: Fix spelling in comments
Do a spell-checking pass.
Signed-off-by: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Ing
sched: Fix spelling in comments
Do a spell-checking pass.
Signed-off-by: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0 |
|
| #
52b33d87 |
| 26-Sep-2022 |
Chengming Zhou <[email protected]> |
sched/psi: Use task->psi_flags to clear in CPU migration
The commit d583d360a620 ("psi: Fix psi state corruption when schedule() races with cgroup move") fixed a race problem by making cgroup_move_t
sched/psi: Use task->psi_flags to clear in CPU migration
The commit d583d360a620 ("psi: Fix psi state corruption when schedule() races with cgroup move") fixed a race problem by making cgroup_move_task() use task->psi_flags instead of looking at the scheduler state.
We can extend task->psi_flags usage to CPU migration, which should be a minor optimization for performance and code simplicity.
Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3 |
|
| #
52b1364b |
| 25-Aug-2022 |
Chengming Zhou <[email protected]> |
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on so
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
d79ddb06 |
| 25-Aug-2022 |
Chengming Zhou <[email protected]> |
sched/psi: Move private helpers to sched/stats.h
This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats trac
sched/psi: Move private helpers to sched/stats.h
This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats tracking in sched/stats.h
psi_task_switch is obvious, psi_task_change can't be public helper since it doesn't check psi_disabled static key. And there is no any user now, so put it in sched/stats.h too.
Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6 |
|
| #
4ff8f2ca |
| 22-Feb-2022 |
Ingo Molnar <[email protected]> |
sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
Remove all headers, except the ones required to make this header build standalone.
Also include stats.h in sched.h
sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
Remove all headers, except the ones required to make this header build standalone.
Also include stats.h in sched.h explicitly - dependencies already require this.
Summary of the build speedup gained through the last ~15 scheduler build & header dependency patches:
Cumulative scheduler (kernel/sched/) build time speedup on a Linux distribution's config, which enables all scheduler features, compared to the vanilla kernel:
_____________________________________________________________________________ | | Vanilla kernel (v5.13-rc7): |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 126,975,564,374 instructions # 1.45 insn per cycle ( +- 0.00% ) | 87,637,847,671 cycles # 3.959 GHz ( +- 0.30% ) | 22,136.96 msec cpu-clock # 7.499 CPUs utilized ( +- 0.29% ) | | 2.9520 +- 0.0169 seconds time elapsed ( +- 0.57% ) |_____________________________________________________________________________ | | Patched kernel: |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 50,420,496,914 instructions # 1.47 insn per cycle ( +- 0.00% ) | 34,234,322,038 cycles # 3.946 GHz ( +- 0.31% ) | 8,675.81 msec cpu-clock # 3.053 CPUs utilized ( +- 0.45% ) | | 2.8420 +- 0.0181 seconds time elapsed ( +- 0.64% ) |_____________________________________________________________________________
Summary:
- CPU time used to build the scheduler dropped by -60.9%, a reduction from 22.1 clock-seconds to 8.7 clock-seconds.
- Wall-clock time to build the scheduler dropped by -3.9%, a reduction from 2.95 seconds to 2.84 seconds.
Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc5, v5.17-rc4 |
|
| #
b9e9c6ca |
| 13-Feb-2022 |
Ingo Molnar <[email protected]> |
sched/headers: Standardize kernel/sched/sched.h header dependencies
kernel/sched/sched.h is a weird mix of ad-hoc headers included in the middle of the header.
Two of them rely on being included in
sched/headers: Standardize kernel/sched/sched.h header dependencies
kernel/sched/sched.h is a weird mix of ad-hoc headers included in the middle of the header.
Two of them rely on being included in the middle of kernel/sched/sched.h, due to definitions they require:
- "stat.h" needs the rq definitions. - "autogroup.h" needs the task_group definition.
Move the inclusion of these two files out of kernel/sched/sched.h, and include them in all files that require them.
Move of the rest of the header dependencies to the top of the kernel/sched/sched.h file.
Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2 |
|
| #
d90a2f16 |
| 20-Nov-2021 |
Ingo Molnar <[email protected]> |
sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h
Protect against multiple inclusion.
Also include "sched.h" in "stat.h", as it relies on it.
Signed-off-by: Ingo
sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h
Protect against multiple inclusion.
Also include "sched.h" in "stat.h", as it relies on it.
Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]>
show more ...
|
|
Revision tags: v5.16-rc1 |
|
| #
cb0e52b7 |
| 10-Nov-2021 |
Brian Chen <[email protected]> |
psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaim
We've noticed cases where tasks in a cgroup are stalled on memory but there is little memory FULL pressure since tasks stay o
psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaim
We've noticed cases where tasks in a cgroup are stalled on memory but there is little memory FULL pressure since tasks stay on the runqueue in reclaim.
A simple example involves a single threaded program that keeps leaking and touching large amounts of memory. It runs in a cgroup with swap enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though there is significant CPU pressure and memory SOME, there is barely any memory FULL since the task enters reclaim and stays on the runqueue. However, this memory-bound task is effectively stalled on memory and we expect memory FULL to match memory SOME in this scenario.
The code is confused about memstall && running, thinking there is a stalled task and a productive task when there's only one task: a reclaimer that's counted as both. To fix this, we redefine the condition for PSI_MEM_FULL to check that all running tasks are in an active memstall instead of checking that there are no running tasks.
case PSI_MEM_FULL: - return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]); + return unlikely(tasks[NR_MEMSTALL] && + tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
This will capture reclaimers. It will also capture tasks that called psi_memstall_enter() and are about to sleep, but this should be negligible noise.
Signed-off-by: Brian Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1 |
|
| #
60f2415e |
| 05-Sep-2021 |
Yafang Shao <[email protected]> |
sched: Make schedstats helpers independent of fair sched class
The original prototype of the schedstats helpers are
update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)
The cfs_rq
sched: Make schedstats helpers independent of fair sched class
The original prototype of the schedstats helpers are
update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)
The cfs_rq in these helpers is used to get the rq_clock, and the se is used to get the struct sched_statistics and the struct task_struct. In order to make these helpers available by all sched classes, we can pass the rq, sched_statistics and task_struct directly.
Then the new helpers are
update_stats_wait_*(struct rq *rq, struct task_struct *p, struct sched_statistics *stats)
which are independent of fair sched class.
To avoid vmlinux growing too large or introducing ovehead when !schedstat_enabled(), some new helpers after schedstat_enabled() are also introduced, Suggested by Mel. These helpers are in sched/stats.c,
__update_stats_wait_*(struct rq *rq, struct task_struct *p, struct sched_statistics *stats)
The size of vmlinux as follows, Before After Size of vmlinux 826308552 826304640 The size is a litte smaller as some functions are not inlined again after the change.
I also compared the sched performance with 'perf bench sched pipe', suggested by Mel. The result as followsi (in usecs/op), Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5
[These data is a little difference with the prev version, that is because my old test machine is destroyed so I have to use a new different test machine.] Almost no difference.
No functional change.
[[email protected]: reported build failure in prev version]
Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Mel Gorman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
ceeadb83 |
| 05-Sep-2021 |
Yafang Shao <[email protected]> |
sched: Make struct sched_statistics independent of fair sched class
If we want to use the schedstats facility to trace other sched classes, we should make it independent of fair sched class. The str
sched: Make struct sched_statistics independent of fair sched class
If we want to use the schedstats facility to trace other sched classes, we should make it independent of fair sched class. The struct sched_statistics is the schedular statistics of a task_struct or a task_group. So we can move it into struct task_struct and struct task_group to achieve the goal.
After the patch, schestats are orgnized as follows,
struct task_struct { ... struct sched_entity se; struct sched_rt_entity rt; struct sched_dl_entity dl; ... struct sched_statistics stats; ... };
Regarding the task group, schedstats is only supported for fair group sched, and a new struct sched_entity_stats is introduced, suggested by Peter -
struct sched_entity_stats { struct sched_entity se; struct sched_statistics stats; } __no_randomize_layout;
Then with the se in a task_group, we can easily get the stats.
The sched_statistics members may be frequently modified when schedstats is enabled, in order to avoid impacting on random data which may in the same cacheline with them, the struct sched_statistics is defined as cacheline aligned.
As this patch changes the core struct of scheduler, so I verified the performance it may impact on the scheduler with 'perf bench sched pipe', suggested by Mel. Below is the result, in which all the values are in usecs/op. Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5 [These data is a little difference with the earlier version, that is because my old test machine is destroyed so I have to use a new different test machine.]
Almost no impact on the sched performance.
No functional change.
[[email protected]: reported build failure in earlier version]
Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Mel Gorman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6 |
|
| #
b03fbd4f |
| 11-Jun-2021 |
Peter Zijlstra <[email protected]> |
sched: Introduce task_is_running()
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p).
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Davidl
sched: Introduce task_is_running()
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p).
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2 |
|
| #
90a0ff4e |
| 12-May-2021 |
Peter Zijlstra <[email protected]> |
sched,stats: Further simplify sched_info
There's no point doing delta==0 updates.
Suggested-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
|
|
Revision tags: v5.13-rc1 |
|
| #
c5895d3f |
| 04-May-2021 |
Peter Zijlstra <[email protected]> |
sched: Simplify sched_info_on()
The situation around sched_info is somewhat complicated, it is used by sched_stats and delayacct and, indirectly, kvm.
If SCHEDSTATS=Y (but disabled by default) sche
sched: Simplify sched_info_on()
The situation around sched_info is somewhat complicated, it is used by sched_stats and delayacct and, indirectly, kvm.
If SCHEDSTATS=Y (but disabled by default) sched_info_on() is unconditionally true -- this is the case for all distro kernel configs I checked.
If for some reason SCHEDSTATS=N, but TASK_DELAY_ACCT=Y, then sched_info_on() can return false when delayacct is disabled, presumably because there would be no other users left; except kvm is.
Instead of complicating matters further by accurately accounting sched_stat and kvm state, simply unconditionally enable when SCHED_INFO=Y, matching the common distro case.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
4e29fb70 |
| 04-May-2021 |
Peter Zijlstra <[email protected]> |
sched: Rename sched_info_{queued,dequeued}
For consistency, rename {queued,dequeued} to {enqueue,dequeue}.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Rik van Riel <ri
sched: Rename sched_info_{queued,dequeued}
For consistency, rename {queued,dequeued} to {enqueue,dequeue}.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Balbir Singh <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2 |
|
| #
4117cebf |
| 03-Mar-2021 |
Chengming Zhou <[email protected]> |
psi: Optimize task switch inside shared cgroups
The commit 36b238d57172 ("psi: Optimize switching tasks inside shared cgroups") only update cgroups whose state actually changes during a task switch
psi: Optimize task switch inside shared cgroups
The commit 36b238d57172 ("psi: Optimize switching tasks inside shared cgroups") only update cgroups whose state actually changes during a task switch only in task preempt case, not in task sleep case.
We actually don't need to clear and set TSK_ONCPU state for common cgroups of next and prev task in sleep case, that can save many psi_group_change especially when most activity comes from one leaf cgroup.
sleep before: psi_dequeue() while ((group = iterate_groups(prev))) # all ancestors psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU) psi_task_switch() while ((group = iterate_groups(next))) # all ancestors psi_group_change(next, .set=TSK_ONCPU)
sleep after: psi_dequeue() nop psi_task_switch() while ((group = iterate_groups(next))) # until (prev & next) psi_group_change(next, .set=TSK_ONCPU) while ((group = iterate_groups(prev))) # all ancestors psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
When a voluntary sleep switches to another task, we remove one call of psi_group_change() for every common cgroup ancestor of the two tasks.
Co-developed-by: Muchun Song <[email protected]> Signed-off-by: Muchun Song <[email protected]> Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
7fae6c81 |
| 03-Mar-2021 |
Chengming Zhou <[email protected]> |
psi: Use ONCPU state tracking machinery to detect reclaim
Move the reclaim detection from the timer tick to the task state tracking machinery using the recently added ONCPU state. And we also add ta
psi: Use ONCPU state tracking machinery to detect reclaim
Move the reclaim detection from the timer tick to the task state tracking machinery using the recently added ONCPU state. And we also add task psi_flags changes checking in the psi_task_switch() optimization to update the parents properly.
In terms of performance and cost, this ONCPU task state tracking is not cheaper than previous timer tick in aggregate. But the code is simpler and shorter this way, so it's a maintainability win. And Johannes did some testing with perf bench, the performace and cost changes would be acceptable for real workloads.
Thanks to Johannes Weiner for pointing out the psi_task_switch() optimization things and the clearer changelog.
Co-developed-by: Muchun Song <[email protected]> Signed-off-by: Muchun Song <[email protected]> Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2, v5.7-rc1, v5.6, v5.6-rc7 |
|
| #
1066d1b6 |
| 17-Mar-2020 |
Yafang Shao <[email protected]> |
psi: Move PF_MEMSTALL out of task->flags
The task->flags is a 32-bits flag, in which 31 bits have already been consumed. So it is hardly to introduce other new per process flag. Currently there're s
psi: Move PF_MEMSTALL out of task->flags
The task->flags is a 32-bits flag, in which 31 bits have already been consumed. So it is hardly to introduce other new per process flag. Currently there're still enough spaces in the bit-field section of task_struct, so we can define the memstall state as a single bit in task_struct instead. This patch also removes an out-of-date comment pointed by Matthew.
Suggested-by: Johannes Weiner <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
36b238d5 |
| 16-Mar-2020 |
Johannes Weiner <[email protected]> |
psi: Optimize switching tasks inside shared cgroups
When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that,
psi: Optimize switching tasks inside shared cgroups
When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that, and can perform many unnecessary state changes in nested hierarchies, especially when most activity comes from one leaf cgroup.
This patch implements an optimization where we only update cgroups whose state actually changes during a task switch. These are all cgroups that contain one task but not the other, up to the first shared ancestor. When both tasks are in the same group, we don't need to update anything at all.
We can identify the first shared ancestor by walking the groups of the incoming task until we see TSK_ONCPU set on the local CPU; that's the first group that also contains the outgoing task.
The new psi_task_switch() is similar to psi_task_change(). To allow code reuse, move the task flag maintenance code into a new function and the poll/avg worker wakeups into the shared psi_group_change().
Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|