|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4 |
|
| #
bbce3de7 |
| 25-Apr-2025 |
Omar Sandoval <[email protected]> |
sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
There is a code path in dequeue_entities() that can set the slice of a sched_entity to U64_MAX, which sometimes results in a crash
sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
There is a code path in dequeue_entities() that can set the slice of a sched_entity to U64_MAX, which sometimes results in a crash.
The offending case is when dequeue_entities() is called to dequeue a delayed group entity, and then the entity's parent's dequeue is delayed. In that case:
1. In the if (entity_is_task(se)) else block at the beginning of dequeue_entities(), slice is set to cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX. 2. The first for_each_sched_entity() loop dequeues the entity. 3. If the entity was its parent's only child, then the next iteration tries to dequeue the parent. 4. If the parent's dequeue needs to be delayed, then it breaks from the first for_each_sched_entity() loop _without updating slice_. 5. The second for_each_sched_entity() loop sets the parent's ->slice to the saved slice, which is still U64_MAX.
This throws off subsequent calculations with potentially catastrophic results. A manifestation we saw in production was:
6. In update_entity_lag(), se->slice is used to calculate limit, which ends up as a huge negative number. 7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit is negative, vlag > limit, so se->vlag is set to the same huge negative number. 8. In place_entity(), se->vlag is scaled, which overflows and results in another huge (positive or negative) number. 9. The adjusted lag is subtracted from se->vruntime, which increases or decreases se->vruntime by a huge number. 10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which incorrectly returns false because the vruntime is so far from the other vruntimes on the queue, causing the (vruntime - cfs_rq->min_vruntime) * load calulation to overflow. 11. Nothing appears to be eligible, so pick_eevdf() returns NULL. 12. pick_next_entity() tries to dereference the return value of pick_eevdf() and crashes.
Dumping the cfs_rq states from the core dumps with drgn showed tell-tale huge vruntime ranges and bogus vlag values, and I also traced se->slice being set to U64_MAX on live systems (which was usually "benign" since the rest of the runqueue needed to be in a particular state to crash).
Fix it in dequeue_entities() by always setting slice from the first non-empty cfs_rq.
Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Omar Sandoval <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
show more ...
|
|
Revision tags: v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
dd5bdaf2 |
| 17-Mar-2025 |
Ingo Molnar <[email protected]> |
sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
All the big Linux distros enable CONFIG_SCHED_DEBUG, because the various features it provides help not just with kernel development,
sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
All the big Linux distros enable CONFIG_SCHED_DEBUG, because the various features it provides help not just with kernel development, but with system administration and user-space software development as well.
Reflect this reality and enable this functionality unconditionally.
Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Shrikanth Hegde <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Ben Segall <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
57903f72 |
| 17-Mar-2025 |
Ingo Molnar <[email protected]> |
sched/debug: Make 'const_debug' tunables unconditional __read_mostly
With CONFIG_SCHED_DEBUG becoming unconditional, remove the extra 'const_debug' indirection towards __read_mostly.
Signed-off-by:
sched/debug: Make 'const_debug' tunables unconditional __read_mostly
With CONFIG_SCHED_DEBUG becoming unconditional, remove the extra 'const_debug' indirection towards __read_mostly.
Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Shrikanth Hegde <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Ben Segall <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
f7d2728c |
| 17-Mar-2025 |
Ingo Molnar <[email protected]> |
sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG.
Since CONFIG_SCHED_DEBUG is getting removed, convert S
sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG.
Since CONFIG_SCHED_DEBUG is getting removed, convert SCHED_WARN() to WARN_ON_ONCE().
Note that the warning output isn't 100% equivalent:
#define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
Because SCHED_WARN_ON() would output the 'x' condition as well, while WARN_ONCE() will only show a backtrace.
Hopefully these are rare enough to not really matter.
If it does, we should probably introduce a new WARN_ON() variant that outputs the condition in stringified form, or improve WARN_ON() itself.
Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Shrikanth Hegde <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Ben Segall <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6 |
|
| #
3b4035dd |
| 04-Mar-2025 |
Zecheng Li <[email protected]> |
sched/fair: Fix potential memory corruption in child_cfs_rq_on_list
child_cfs_rq_on_list attempts to convert a 'prev' pointer to a cfs_rq. This 'prev' pointer can originate from struct rq's leaf_cfs
sched/fair: Fix potential memory corruption in child_cfs_rq_on_list
child_cfs_rq_on_list attempts to convert a 'prev' pointer to a cfs_rq. This 'prev' pointer can originate from struct rq's leaf_cfs_rq_list, making the conversion invalid and potentially leading to memory corruption. Depending on the relative positions of leaf_cfs_rq_list and the task group (tg) pointer within the struct, this can cause a memory fault or access garbage data.
The issue arises in list_add_leaf_cfs_rq, where both cfs_rq->leaf_cfs_rq_list and rq->leaf_cfs_rq_list are added to the same leaf list. Also, rq->tmp_alone_branch can be set to rq->leaf_cfs_rq_list.
This adds a check `if (prev == &rq->leaf_cfs_rq_list)` after the main conditional in child_cfs_rq_on_list. This ensures that the container_of operation will convert a correct cfs_rq struct.
This check is sufficient because only cfs_rqs on the same CPU are added to the list, so verifying the 'prev' pointer against the current rq's list head is enough.
Fixes a potential memory corruption issue that due to current struct layout might not be manifesting as a crash but could lead to unpredictable behavior when the layout changes.
Fixes: fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling") Signed-off-by: Zecheng Li <[email protected]> Reviewed-and-tested-by: K Prateek Nayak <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
ee13da87 |
| 05-Feb-2025 |
Nam Cao <[email protected]> |
sched: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely.
Replace hrtimer_init() and the open coded initialization o
sched: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism.
Signed-off-by: Nam Cao <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/a55e849cba3c41b4c5708be6ea6be6f337d1a8fb.1738746821.git.namcao@linutronix.de
show more ...
|
| #
d34e7980 |
| 10-Feb-2025 |
I Hsin Cheng <[email protected]> |
sched/fair: Refactor can_migrate_task() to elimate looping
The function "can_migrate_task()" utilize "for_each_cpu_and" with a "if" statement inside to find the destination cpu. It's the same logic
sched/fair: Refactor can_migrate_task() to elimate looping
The function "can_migrate_task()" utilize "for_each_cpu_and" with a "if" statement inside to find the destination cpu. It's the same logic to find the first set bit of the result of the bitwise-AND of "env->dst_grpmask", "env->cpus" and "p->cpus_ptr".
Refactor it by using "cpumask_first_and_and()" to perform bitwise-AND for "env->dst_grpmask", "env->cpus" and "p->cpus_ptr" and pick the first cpu within the intersection as the destination cpu, so we can elimate the need of looping and multiple times of branch.
After the refactoring this part of the code can speed up from ~115ns to ~54ns, according to the test below.
Ran the test for 5 times and the result is showned in the following table, and the test script is paste in next section.
------------------------------------------------------- |Old method| 130| 118| 115| 109| 106| avg ~115ns| ------------------------------------------------------- |New method| 58| 55| 54| 48| 55| avg ~54ns| -------------------------------------------------------
Signed-off-by: I Hsin Cheng <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
563bc216 |
| 11-Feb-2025 |
Tianchen Ding <[email protected]> |
sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasks
When a task is enqueued and its parent cgroup se is already on_rq, this parent cgroup se will not be enqueued again, and he
sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasks
When a task is enqueued and its parent cgroup se is already on_rq, this parent cgroup se will not be enqueued again, and hence the root->min_slice leaves unchanged. The same issue happens when a task is dequeued and its parent cgroup se has other runnable entities, and the parent cgroup se will not be dequeued.
Force propagating min_slice when se doesn't need to be enqueued or dequeued. Ensure the se hierarchy always get the latest min_slice.
Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Tianchen Ding <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
2ae891b8 |
| 08-Feb-2025 |
zihan zhou <[email protected]> |
sched: Reduce the default slice to avoid tasks getting an extra tick
The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which means that we have a default slice of:
0.75 for 1 cpu
sched: Reduce the default slice to avoid tasks getting an extra tick
The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which means that we have a default slice of:
0.75 for 1 cpu 1.50 up to 3 cpus 2.25 up to 7 cpus 3.00 for 8 cpus and above.
For HZ=250 and HZ=100, because of the tick accuracy, the runtime of tasks is far higher than their slice.
For HZ=1000 with 8 cpus or more, the accuracy of tick is already satisfactory, but there is still an issue that tasks will get an extra tick because the tick often arrives a little faster than expected. In this case, the task can only wait until the next tick to consider that it has reached its deadline, and will run 1ms longer.
vruntime + sysctl_sched_base_slice = deadline |-----------|-----------|-----------|-----------| 1ms 1ms 1ms 1ms ^ ^ ^ ^ tick1 tick2 tick3 tick4(nearly 4ms)
There are two reasons for tick error: clockevent precision and the CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even without it, because of clockevent precision, tick still often less than 1ms.
In order to make scheduling more precise, we changed 0.75 to 0.70, Using 0.70 instead of 0.75 should not change much for other configs and would fix this issue:
0.70 for 1 cpu 1.40 up to 3 cpus 2.10 up to 7 cpus 2.8 for 8 cpus and above.
This does not guarantee that tasks can run the slice time accurately every time, but occasionally running an extra tick has little impact.
Signed-off-by: zihan zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
f553741a |
| 08-Feb-2025 |
zihan zhou <[email protected]> |
sched: Cancel the slice protection of the idle entity
A wakeup non-idle entity should preempt idle entity at any time, but because of the slice protection of the idle entity, the non-idle entity has
sched: Cancel the slice protection of the idle entity
A wakeup non-idle entity should preempt idle entity at any time, but because of the slice protection of the idle entity, the non-idle entity has to wait, so just cancel it.
This patch is aimed at minimizing the impact of SCHED_IDLE on SCHED_NORMAL. For example, a task with SCHED_IDLE policy that sleeps for 1s and then runs for 3 ms, running cyclictest on the same cpu, has a maximum latency of 3 ms, which is caused by the slice protection of the idle entity. It is unreasonable. With this patch, the cyclictest latency under the same conditions is basically the same on the cpu with idle processes and on empty cpu.
[peterz: add helpers] Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") Signed-off-by: zihan zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.14-rc1 |
|
| #
1751f872 |
| 28-Jan-2025 |
Joel Granados <[email protected]> |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysc
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers.
Created this by running an spatch followed by a sed command: Spatch: virtual patch
@ depends on !(file in "net") disable optional_qualifier @
identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@
+ const struct ctl_table table_name [] = { ... };
sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c
Reviewed-by: Song Liu <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> # for kernel/trace/ Reviewed-by: Martin K. Petersen <[email protected]> # SCSI Reviewed-by: Darrick J. Wong <[email protected]> # xfs Acked-by: Jani Nikula <[email protected]> Acked-by: Corey Minyard <[email protected]> Acked-by: Wei Liu <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Reviewed-by: Bill O'Donnell <[email protected]> Acked-by: Baoquan He <[email protected]> Acked-by: Ashutosh Dixit <[email protected]> Acked-by: Anna Schumaker <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.13 |
|
| #
3429dd57 |
| 17-Jan-2025 |
K Prateek Nayak <[email protected]> |
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an entity is delayed irrespective of whether the entity co
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an entity is delayed irrespective of whether the entity corresponds to a task or a cfs_rq.
Consider the following scenario:
root / \ A B (*) delayed since B is no longer eligible on root | | Task0 Task1 <--- dequeue_task_fair() - task blocks
When Task1 blocks (dequeue_entity() for task's se returns true), dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the hierarchy of Task1. However, when the sched_entity corresponding to cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the hierarchy too leading to both dequeue_entity() and set_delayed() decrementing h_nr_runnable for the dequeue of the same task.
A SCHED_WARN_ON() to inspect h_nr_runnable post its update in dequeue_entities() like below:
cfs_rq->h_nr_runnable -= h_nr_runnable; SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
is consistently tripped when running wakeup intensive workloads like hackbench in a cgroup.
This error is self correcting since cfs_rq are per-cpu and cannot migrate. The entitiy is either picked for full dequeue or is requeued when a task wakes up below it. Both those paths call clear_delayed() which again increments h_nr_runnable of the hierarchy without considering if the entity corresponds to a task or not.
h_nr_runnable will eventually reflect the correct value however in the interim, the incorrect values can still influence PELT calculation which uses se->runnable_weight or cfs_rq->h_nr_runnable.
Since only delayed tasks take the early return path in dequeue_entities() and enqueue_task_fair(), adjust the h_nr_runnable in {set,clear}_delayed() only when a task is delayed as this path skips the h_nr_* update loops and returns early.
For entities corresponding to cfs_rq, the h_nr_* update loop in the caller will do the right thing.
Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") Signed-off-by: K Prateek Nayak <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Tested-by: Swapnil Sapkal <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5 |
|
| #
3229adbe |
| 23-Dec-2024 |
K Prateek Nayak <[email protected]> |
sched/fair: Do not compute overloaded status unnecessarily during lb
Only set sg_overloaded when computing sg_lb_stats() at the highest sched domain since rd->overloaded status is updated only when
sched/fair: Do not compute overloaded status unnecessarily during lb
Only set sg_overloaded when computing sg_lb_stats() at the highest sched domain since rd->overloaded status is updated only when load balancing at the highest domain. While at it, move setting of sg_overloaded below idle_cpu() check since an idle CPU can never be overloaded.
Signed-off-by: K Prateek Nayak <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Reviewed-by: Shrikanth Hegde <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
0ac1ee9e |
| 23-Dec-2024 |
K Prateek Nayak <[email protected]> |
sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
Aggregate nr_numa_running and nr_preferred_running when load balancing at NUMA domains only. While at it, also move the aggreg
sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
Aggregate nr_numa_running and nr_preferred_running when load balancing at NUMA domains only. While at it, also move the aggregation below the idle_cpu() check since an idle CPU cannot have any preferred tasks.
Signed-off-by: K Prateek Nayak <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Shrikanth Hegde <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
873199d2 |
| 23-Dec-2024 |
Hao Jia <[email protected]> |
sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
When the PLACE_LAG scheduling feature is enabled and dst_cfs_rq->nr_queued is greater than 1, if a task is ineligible (lag < 0)
sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
When the PLACE_LAG scheduling feature is enabled and dst_cfs_rq->nr_queued is greater than 1, if a task is ineligible (lag < 0) on the source cpu runqueue, it will also be ineligible when it is migrated to the destination cpu runqueue. Because we will keep the original equivalent lag of the task in place_entity(). So if the task was ineligible before, it will still be ineligible after migration.
So in sched_balance_rq(), we prioritize migrating eligible tasks, and we soft-limit ineligible tasks, allowing them to migrate only when nr_balance_failed is non-zero to avoid load-balancing trying very hard to balance the load.
Below are some benchmark test results. From my test results, this patch shows a slight improvement on hackbench.
Benchmark =========
All of the benchmarks are done inside a normal cpu cgroup in a clean environment with cpu turbo disabled, and test machine is:
Single NUMA machine model is 13th Gen Intel(R) Core(TM) i7-13700, 12 Core/24 HT.
Based on master b86545e02e8c.
Results =======
hackbench-process-pipes vanilla patched Amean 1 0.5837 ( 0.00%) 0.5733 ( 1.77%) Amean 4 1.4423 ( 0.00%) 1.4503 ( -0.55%) Amean 7 2.5147 ( 0.00%) 2.4773 ( 1.48%) Amean 12 3.9347 ( 0.00%) 3.8880 ( 1.19%) Amean 21 5.3943 ( 0.00%) 5.3873 ( 0.13%) Amean 30 6.7840 ( 0.00%) 6.6660 ( 1.74%) Amean 48 9.8313 ( 0.00%) 9.6100 ( 2.25%) Amean 79 15.4403 ( 0.00%) 14.9580 ( 3.12%) Amean 96 18.4970 ( 0.00%) 17.9533 ( 2.94%)
hackbench-process-sockets vanilla patched Amean 1 0.6297 ( 0.00%) 0.6223 ( 1.16%) Amean 4 2.1517 ( 0.00%) 2.0887 ( 2.93%) Amean 7 3.6377 ( 0.00%) 3.5670 ( 1.94%) Amean 12 6.1277 ( 0.00%) 5.9290 ( 3.24%) Amean 21 10.0380 ( 0.00%) 9.7623 ( 2.75%) Amean 30 14.1517 ( 0.00%) 13.7513 ( 2.83%) Amean 48 24.7253 ( 0.00%) 24.2287 ( 2.01%) Amean 79 43.9523 ( 0.00%) 43.2330 ( 1.64%) Amean 96 54.5310 ( 0.00%) 53.7650 ( 1.40%)
tbench4 Throughput vanilla patched Hmean 1 255.97 ( 0.00%) 275.01 ( 7.44%) Hmean 2 511.60 ( 0.00%) 544.27 ( 6.39%) Hmean 4 996.70 ( 0.00%) 1006.57 ( 0.99%) Hmean 8 1646.46 ( 0.00%) 1649.15 ( 0.16%) Hmean 16 2259.42 ( 0.00%) 2274.35 ( 0.66%) Hmean 32 4725.48 ( 0.00%) 4735.57 ( 0.21%) Hmean 64 4411.47 ( 0.00%) 4400.05 ( -0.26%) Hmean 96 4284.31 ( 0.00%) 4267.39 ( -0.39%)
Suggested-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Hao Jia <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
2cf9ac40 |
| 11-Jan-2025 |
Vincent Guittot <[email protected]> |
sched/fair: Encapsulate set custom slice in a __setparam_fair() function
Similarly to dl, create a __setparam_fair() function to set parameters related to fair class and move it in the fair.c file.
sched/fair: Encapsulate set custom slice in a __setparam_fair() function
Similarly to dl, create a __setparam_fair() function to set parameters related to fair class and move it in the fair.c file.
No functional changes expected
Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Phil Auld <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
66951e48 |
| 13-Jan-2025 |
Peter Zijlstra <[email protected]> |
sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
Normally dequeue_entities() will continue to dequeue an empty group entity; except DELAY_DEQUEUE changes things -- it retains empty entities such
sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
Normally dequeue_entities() will continue to dequeue an empty group entity; except DELAY_DEQUEUE changes things -- it retains empty entities such that they might continue to compete and burn off some lag.
However, doing this results in update_cfs_group() re-computing the cgroup weight 'slice' for an empty group, which it (rightly) figures isn't much at all. This in turn means that the delayed entity is not competing at the expected weight. Worse, the very low weight causes its lag to be inflated, which combined with avg_vruntime() using scale_load_down(), leads to artifacts.
As such, don't adjust the weight for empty group entities and let them compete at their original weight.
Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
6d71a9c6 |
| 09-Jan-2025 |
Peter Zijlstra <[email protected]> |
sched/fair: Fix EEVDF entity placement bug causing scheduling lag
I noticed this in my traces today:
turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885e
sched/fair: Fix EEVDF entity placement bug causing scheduling lag
I noticed this in my traces today:
turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } -> { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } -> { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }
Which is a weight transition: 1048576 -> 2 -> 1048576.
One would expect the lag to shoot out *AND* come back, notably:
-1123*1048576/2 = -588775424 -588775424*2/1048576 = -1123
Except the trace shows it is all off. Worse, subsequent cycles shoot it out further and further.
This made me have a very hard look at reweight_entity(), and specifically the ->on_rq case, which is more prominent with DELAY_DEQUEUE.
And indeed, it is all sorts of broken. While the computation of the new lag is correct, the computation for the new vruntime, using the new lag is broken for it does not consider the logic set out in place_entity().
With the below patch, I now see things like:
migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } -> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 } migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } -> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }
Which isn't perfect yet, but much closer.
Reported-by: Doug Smythies <[email protected]> Reported-by: Ingo Molnar <[email protected]> Tested-by: Ingo Molnar <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.13-rc4 |
|
| #
3b2a793e |
| 20-Dec-2024 |
Swapnil Sapkal <[email protected]> |
sched: Report the different kinds of imbalances in /proc/schedstat
In /proc/schedstat, lb_imbalance reports the sum of imbalances discovered in sched domains with each call to sched_balance_rq(), wh
sched: Report the different kinds of imbalances in /proc/schedstat
In /proc/schedstat, lb_imbalance reports the sum of imbalances discovered in sched domains with each call to sched_balance_rq(), which is not very useful because lb_imbalance does not mention whether the imbalance is due to load, utilization, nr_tasks or misfit_tasks. Remove this field from /proc/schedstat.
Currently there is no field in /proc/schedstat to report different types of imbalances. Introduce new fields in /proc/schedstat to report the total imbalances in load, utilization, nr_tasks or misfit_tasks.
Added fields to /proc/schedstat: - lb_imbalance_load: Total imbalance due to load. - lb_imbalance_util: Total imbalance due to utilization. - lb_imbalance_task: Total imbalance due to number of tasks. - lb_imbalance_misfit: Total imbalance due to misfit tasks.
Signed-off-by: Swapnil Sapkal <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Shrikanth Hegde <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
c3856c9c |
| 20-Dec-2024 |
Peter Zijlstra <[email protected]> |
sched/fair: Cleanup in migrate_degrades_locality() to improve readability
migrate_degrade_locality() would return {1, 0, -1} respectively to indicate that migration would degrade-locality, would imp
sched/fair: Cleanup in migrate_degrades_locality() to improve readability
migrate_degrade_locality() would return {1, 0, -1} respectively to indicate that migration would degrade-locality, would improve locality, would be ambivalent to locality improvements.
This patch improves readability by changing the return value to mean: * Any positive value degrades locality * 0 migration doesn't affect locality * Any negative value improves locality
[Swapnil: Fixed comments around code and wrote commit log]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Not-yet-signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Swapnil Sapkal <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
a430d99e |
| 20-Dec-2024 |
Peter Zijlstra <[email protected]> |
sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat
In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled during load balance. This value is incremented in can_migr
sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat
In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled during load balance. This value is incremented in can_migrate_task() if the task is migratable and hot. After incrementing the value, load balancer can still decide not to migrate this task leading to wrong accounting. Fix this by incrementing stats when hot tasks are detached. This issue only exists in detach_tasks() where we can decide to not migrate hot task even if it is migratable. However, in detach_one_task(), we migrate it unconditionally.
[Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log]
Fixes: d31980846f96 ("sched: Move up affinity check to mitigate useless redoing overhead") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reported-by: "Gautham R. Shenoy" <[email protected]> Not-yet-signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Swapnil Sapkal <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
ee8118c1 |
| 19-Dec-2024 |
Sebastian Andrzej Siewior <[email protected]> |
sched/fair: Update comments after sched_tick() rename.
scheduler_tick() was renamed to sched_tick() in 86dd6c04ef9f2 ("sched/balancing: Rename scheduler_tick() => sched_tick()").
Update comments st
sched/fair: Update comments after sched_tick() rename.
scheduler_tick() was renamed to sched_tick() in 86dd6c04ef9f2 ("sched/balancing: Rename scheduler_tick() => sched_tick()").
Update comments still referring to scheduler_tick.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.13-rc3 |
|
| #
af98d8a3 |
| 12-Dec-2024 |
Vishal Chourasia <[email protected]> |
sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug
CPU controller limits are not properly enforced during CPU hotplug operations, particularly during CPU offline. When a CPU goes offline,
sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug
CPU controller limits are not properly enforced during CPU hotplug operations, particularly during CPU offline. When a CPU goes offline, throttled processes are unintentionally being unthrottled across all CPUs in the system, allowing them to exceed their assigned quota limits.
Consider below for an example,
Assigning 6.25% bandwidth limit to a cgroup in a 8 CPU system, where, workload is running 8 threads for 20 seconds at 100% CPU utilization, expected (user+sys) time = 10 seconds.
$ cat /sys/fs/cgroup/test/cpu.max 50000 100000
$ ./ebizzy -t 8 -S 20 // non-hotplug case real 20.00 s user 10.81 s // intended behaviour sys 0.00 s
$ ./ebizzy -t 8 -S 20 // hotplug case real 20.00 s user 14.43 s // Workload is able to run for 14 secs sys 0.00 s // when it should have only run for 10 secs
During CPU hotplug, scheduler domains are rebuilt and cpu_attach_domain is called for every active CPU to update the root domain. That ends up calling rq_offline_fair which un-throttles any throttled hierarchies.
Unthrottling should only occur for the CPU being hotplugged to allow its throttled processes to become runnable and get migrated to other CPUs.
With current patch applied, $ ./ebizzy -t 8 -S 20 // hotplug case real 21.00 s user 10.16 s // intended behaviour sys 0.00 s
This also has another symptom, when a CPU goes offline, and if the cfs_rq is not in throttled state and the runtime_remaining still had plenty remaining, it gets reset to 1 here, causing the runtime_remaining of cfs_rq to be quickly depleted.
Note: hotplug operation (online, offline) was performed in while(1) loop
v3: https://lore.kernel.org/all/[email protected] v2: https://lore.kernel.org/all/[email protected] v1: https://lore.kernel.org/all/[email protected] Suggested-by: Zhang Qiao <[email protected]> Signed-off-by: Vishal Chourasia <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Vincent Guittot <[email protected]> Tested-by: Madadi Vineeth Reddy <[email protected]> Tested-by: Samir Mulani <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
c7f7e9c7 |
| 13-Dec-2024 |
Vineeth Pillai (Google) <[email protected]> |
sched/dlserver: Fix dlserver time accounting
dlserver time is accounted when: - dlserver is active and the dlserver proxies the cfs task. - dlserver is active but deferred and cfs task runs after
sched/dlserver: Fix dlserver time accounting
dlserver time is accounted when: - dlserver is active and the dlserver proxies the cfs task. - dlserver is active but deferred and cfs task runs after being picked through the normal fair class pick.
dl_server_update is called in two places to make sure that both the above times are accounted for. But it doesn't check if dlserver is active or not. Now that we have this dl_server_active flag, we can consolidate dl_server_update into one place and all we need to check is whether dlserver is active or not. When dlserver is active there is only two possible conditions: - dlserver is deferred. - cfs task is running on behalf of dlserver.
Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Signed-off-by: "Vineeth Pillai (Google)" <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Marcel Ziswiler <[email protected]> # ROCK 5B Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1 |
|
| #
2a77e4be |
| 29-Nov-2024 |
Peter Zijlstra <[email protected]> |
sched/fair: Untangle NEXT_BUDDY and pick_next_task()
There are 3 sites using set_next_buddy() and only one is conditional on NEXT_BUDDY, the other two sites are unconditional; to note:
- yield_to
sched/fair: Untangle NEXT_BUDDY and pick_next_task()
There are 3 sites using set_next_buddy() and only one is conditional on NEXT_BUDDY, the other two sites are unconditional; to note:
- yield_to_task() - cgroup dequeue / pick optimization
However, having NEXT_BUDDY control both the wakeup-preemption and the picking side of things means its near useless.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|