History log of /linux-6.15/kernel/sched/fair.c (Results 1 – 25 of 1404)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4
# bbce3de7 25-Apr-2025 Omar Sandoval <[email protected]>

sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash

There is a code path in dequeue_entities() that can set the slice of a
sched_entity to U64_MAX, which sometimes results in a crash

sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash

There is a code path in dequeue_entities() that can set the slice of a
sched_entity to U64_MAX, which sometimes results in a crash.

The offending case is when dequeue_entities() is called to dequeue a
delayed group entity, and then the entity's parent's dequeue is delayed.
In that case:

1. In the if (entity_is_task(se)) else block at the beginning of
dequeue_entities(), slice is set to
cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then
it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.
2. The first for_each_sched_entity() loop dequeues the entity.
3. If the entity was its parent's only child, then the next iteration
tries to dequeue the parent.
4. If the parent's dequeue needs to be delayed, then it breaks from the
first for_each_sched_entity() loop _without updating slice_.
5. The second for_each_sched_entity() loop sets the parent's ->slice to
the saved slice, which is still U64_MAX.

This throws off subsequent calculations with potentially catastrophic
results. A manifestation we saw in production was:

6. In update_entity_lag(), se->slice is used to calculate limit, which
ends up as a huge negative number.
7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
is negative, vlag > limit, so se->vlag is set to the same huge
negative number.
8. In place_entity(), se->vlag is scaled, which overflows and results in
another huge (positive or negative) number.
9. The adjusted lag is subtracted from se->vruntime, which increases or
decreases se->vruntime by a huge number.
10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
incorrectly returns false because the vruntime is so far from the
other vruntimes on the queue, causing the
(vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
11. Nothing appears to be eligible, so pick_eevdf() returns NULL.
12. pick_next_entity() tries to dereference the return value of
pick_eevdf() and crashes.

Dumping the cfs_rq states from the core dumps with drgn showed tell-tale
huge vruntime ranges and bogus vlag values, and I also traced se->slice
being set to U64_MAX on live systems (which was usually "benign" since
the rest of the runqueue needed to be in a particular state to crash).

Fix it in dequeue_entities() by always setting slice from the first
non-empty cfs_rq.

Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com

show more ...


Revision tags: v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14
# dd5bdaf2 17-Mar-2025 Ingo Molnar <[email protected]>

sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional

All the big Linux distros enable CONFIG_SCHED_DEBUG, because
the various features it provides help not just with kernel
development,

sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional

All the big Linux distros enable CONFIG_SCHED_DEBUG, because
the various features it provides help not just with kernel
development, but with system administration and user-space
software development as well.

Reflect this reality and enable this functionality
unconditionally.

Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Shrikanth Hegde <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 57903f72 17-Mar-2025 Ingo Molnar <[email protected]>

sched/debug: Make 'const_debug' tunables unconditional __read_mostly

With CONFIG_SCHED_DEBUG becoming unconditional, remove the
extra 'const_debug' indirection towards __read_mostly.

Signed-off-by:

sched/debug: Make 'const_debug' tunables unconditional __read_mostly

With CONFIG_SCHED_DEBUG becoming unconditional, remove the
extra 'const_debug' indirection towards __read_mostly.

Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Shrikanth Hegde <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# f7d2728c 17-Mar-2025 Ingo Molnar <[email protected]>

sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()

The scheduler has this special SCHED_WARN() facility that
depends on CONFIG_SCHED_DEBUG.

Since CONFIG_SCHED_DEBUG is getting removed, convert
S

sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()

The scheduler has this special SCHED_WARN() facility that
depends on CONFIG_SCHED_DEBUG.

Since CONFIG_SCHED_DEBUG is getting removed, convert
SCHED_WARN() to WARN_ON_ONCE().

Note that the warning output isn't 100% equivalent:

#define SCHED_WARN_ON(x) WARN_ONCE(x, #x)

Because SCHED_WARN_ON() would output the 'x' condition
as well, while WARN_ONCE() will only show a backtrace.

Hopefully these are rare enough to not really matter.

If it does, we should probably introduce a new WARN_ON()
variant that outputs the condition in stringified form,
or improve WARN_ON() itself.

Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Shrikanth Hegde <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.14-rc7, v6.14-rc6
# 3b4035dd 04-Mar-2025 Zecheng Li <[email protected]>

sched/fair: Fix potential memory corruption in child_cfs_rq_on_list

child_cfs_rq_on_list attempts to convert a 'prev' pointer to a cfs_rq.
This 'prev' pointer can originate from struct rq's leaf_cfs

sched/fair: Fix potential memory corruption in child_cfs_rq_on_list

child_cfs_rq_on_list attempts to convert a 'prev' pointer to a cfs_rq.
This 'prev' pointer can originate from struct rq's leaf_cfs_rq_list,
making the conversion invalid and potentially leading to memory
corruption. Depending on the relative positions of leaf_cfs_rq_list and
the task group (tg) pointer within the struct, this can cause a memory
fault or access garbage data.

The issue arises in list_add_leaf_cfs_rq, where both
cfs_rq->leaf_cfs_rq_list and rq->leaf_cfs_rq_list are added to the same
leaf list. Also, rq->tmp_alone_branch can be set to rq->leaf_cfs_rq_list.

This adds a check `if (prev == &rq->leaf_cfs_rq_list)` after the main
conditional in child_cfs_rq_on_list. This ensures that the container_of
operation will convert a correct cfs_rq struct.

This check is sufficient because only cfs_rqs on the same CPU are added
to the list, so verifying the 'prev' pointer against the current rq's list
head is enough.

Fixes a potential memory corruption issue that due to current struct
layout might not be manifesting as a crash but could lead to unpredictable
behavior when the layout changes.

Fixes: fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling")
Signed-off-by: Zecheng Li <[email protected]>
Reviewed-and-tested-by: K Prateek Nayak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2
# ee13da87 05-Feb-2025 Nam Cao <[email protected]>

sched: Switch to use hrtimer_setup()

hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization o

sched: Switch to use hrtimer_setup()

hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Signed-off-by: Nam Cao <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/a55e849cba3c41b4c5708be6ea6be6f337d1a8fb.1738746821.git.namcao@linutronix.de

show more ...


# d34e7980 10-Feb-2025 I Hsin Cheng <[email protected]>

sched/fair: Refactor can_migrate_task() to elimate looping

The function "can_migrate_task()" utilize "for_each_cpu_and" with a
"if" statement inside to find the destination cpu. It's the same logic

sched/fair: Refactor can_migrate_task() to elimate looping

The function "can_migrate_task()" utilize "for_each_cpu_and" with a
"if" statement inside to find the destination cpu. It's the same logic
to find the first set bit of the result of the bitwise-AND of
"env->dst_grpmask", "env->cpus" and "p->cpus_ptr".

Refactor it by using "cpumask_first_and_and()" to perform bitwise-AND
for "env->dst_grpmask", "env->cpus" and "p->cpus_ptr" and pick the
first cpu within the intersection as the destination cpu, so we can
elimate the need of looping and multiple times of branch.

After the refactoring this part of the code can speed up from ~115ns
to ~54ns, according to the test below.

Ran the test for 5 times and the result is showned in the following
table, and the test script is paste in next section.

-------------------------------------------------------
|Old method| 130| 118| 115| 109| 106| avg ~115ns|
-------------------------------------------------------
|New method| 58| 55| 54| 48| 55| avg ~54ns|
-------------------------------------------------------

Signed-off-by: I Hsin Cheng <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 563bc216 11-Feb-2025 Tianchen Ding <[email protected]>

sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasks

When a task is enqueued and its parent cgroup se is already on_rq, this
parent cgroup se will not be enqueued again, and he

sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasks

When a task is enqueued and its parent cgroup se is already on_rq, this
parent cgroup se will not be enqueued again, and hence the root->min_slice
leaves unchanged. The same issue happens when a task is dequeued and its
parent cgroup se has other runnable entities, and the parent cgroup se
will not be dequeued.

Force propagating min_slice when se doesn't need to be enqueued or
dequeued. Ensure the se hierarchy always get the latest min_slice.

Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Tianchen Ding <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 2ae891b8 08-Feb-2025 zihan zhou <[email protected]>

sched: Reduce the default slice to avoid tasks getting an extra tick

The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
means that we have a default slice of:

0.75 for 1 cpu

sched: Reduce the default slice to avoid tasks getting an extra tick

The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
means that we have a default slice of:

0.75 for 1 cpu
1.50 up to 3 cpus
2.25 up to 7 cpus
3.00 for 8 cpus and above.

For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
tasks is far higher than their slice.

For HZ=1000 with 8 cpus or more, the accuracy of tick is already
satisfactory, but there is still an issue that tasks will get an extra
tick because the tick often arrives a little faster than expected. In
this case, the task can only wait until the next tick to consider that it
has reached its deadline, and will run 1ms longer.

vruntime + sysctl_sched_base_slice = deadline
|-----------|-----------|-----------|-----------|
1ms 1ms 1ms 1ms
^ ^ ^ ^
tick1 tick2 tick3 tick4(nearly 4ms)

There are two reasons for tick error: clockevent precision and the
CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
without it, because of clockevent precision, tick still often less than
1ms.

In order to make scheduling more precise, we changed 0.75 to 0.70,
Using 0.70 instead of 0.75 should not change much for other configs
and would fix this issue:

0.70 for 1 cpu
1.40 up to 3 cpus
2.10 up to 7 cpus
2.8 for 8 cpus and above.

This does not guarantee that tasks can run the slice time accurately
every time, but occasionally running an extra tick has little impact.

Signed-off-by: zihan zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# f553741a 08-Feb-2025 zihan zhou <[email protected]>

sched: Cancel the slice protection of the idle entity

A wakeup non-idle entity should preempt idle entity at any time,
but because of the slice protection of the idle entity, the non-idle
entity has

sched: Cancel the slice protection of the idle entity

A wakeup non-idle entity should preempt idle entity at any time,
but because of the slice protection of the idle entity, the non-idle
entity has to wait, so just cancel it.

This patch is aimed at minimizing the impact of SCHED_IDLE on
SCHED_NORMAL. For example, a task with SCHED_IDLE policy that sleeps for
1s and then runs for 3 ms, running cyclictest on the same cpu, has a
maximum latency of 3 ms, which is caused by the slice protection of the
idle entity. It is unreasonable. With this patch, the cyclictest latency
under the same conditions is basically the same on the cpu with idle
processes and on empty cpu.

[peterz: add helpers]
Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
Signed-off-by: zihan zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


Revision tags: v6.14-rc1
# 1751f872 28-Jan-2025 Joel Granados <[email protected]>

treewide: const qualify ctl_tables where applicable

Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysc

treewide: const qualify ctl_tables where applicable

Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
virtual patch

@
depends on !(file in "net")
disable optional_qualifier
@

identifier table_name != {
watchdog_hardlockup_sysctl,
iwcm_ctl_table,
ucma_ctl_table,
memory_allocation_profiling_sysctls,
loadpin_sysctl_table
};
@@

+ const
struct ctl_table table_name [] = { ... };

sed:
sed --in-place \
-e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
kernel/utsname_sysctl.c

Reviewed-by: Song Liu <[email protected]>
Acked-by: Steven Rostedt (Google) <[email protected]> # for kernel/trace/
Reviewed-by: Martin K. Petersen <[email protected]> # SCSI
Reviewed-by: Darrick J. Wong <[email protected]> # xfs
Acked-by: Jani Nikula <[email protected]>
Acked-by: Corey Minyard <[email protected]>
Acked-by: Wei Liu <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Reviewed-by: Bill O'Donnell <[email protected]>
Acked-by: Baoquan He <[email protected]>
Acked-by: Ashutosh Dixit <[email protected]>
Acked-by: Anna Schumaker <[email protected]>
Signed-off-by: Joel Granados <[email protected]>

show more ...


Revision tags: v6.13
# 3429dd57 17-Jan-2025 K Prateek Nayak <[email protected]>

sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue

set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity co

sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue

set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.

Consider the following scenario:

root
/ \
A B (*) delayed since B is no longer eligible on root
| |
Task0 Task1 <--- dequeue_task_fair() - task blocks

When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.

A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:

cfs_rq->h_nr_runnable -= h_nr_runnable;
SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);

is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.

This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.

h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.

Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.

For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.

Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Tested-by: Swapnil Sapkal <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5
# 3229adbe 23-Dec-2024 K Prateek Nayak <[email protected]>

sched/fair: Do not compute overloaded status unnecessarily during lb

Only set sg_overloaded when computing sg_lb_stats() at the highest sched
domain since rd->overloaded status is updated only when

sched/fair: Do not compute overloaded status unnecessarily during lb

Only set sg_overloaded when computing sg_lb_stats() at the highest sched
domain since rd->overloaded status is updated only when load balancing
at the highest domain. While at it, move setting of sg_overloaded below
idle_cpu() check since an idle CPU can never be overloaded.

Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Reviewed-by: Shrikanth Hegde <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 0ac1ee9e 23-Dec-2024 K Prateek Nayak <[email protected]>

sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb

Aggregate nr_numa_running and nr_preferred_running when load balancing
at NUMA domains only. While at it, also move the aggreg

sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb

Aggregate nr_numa_running and nr_preferred_running when load balancing
at NUMA domains only. While at it, also move the aggregation below the
idle_cpu() check since an idle CPU cannot have any preferred tasks.

Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Shrikanth Hegde <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 873199d2 23-Dec-2024 Hao Jia <[email protected]>

sched/core: Prioritize migrating eligible tasks in sched_balance_rq()

When the PLACE_LAG scheduling feature is enabled and
dst_cfs_rq->nr_queued is greater than 1, if a task is
ineligible (lag < 0)

sched/core: Prioritize migrating eligible tasks in sched_balance_rq()

When the PLACE_LAG scheduling feature is enabled and
dst_cfs_rq->nr_queued is greater than 1, if a task is
ineligible (lag < 0) on the source cpu runqueue, it will
also be ineligible when it is migrated to the destination
cpu runqueue. Because we will keep the original equivalent
lag of the task in place_entity(). So if the task was
ineligible before, it will still be ineligible after
migration.

So in sched_balance_rq(), we prioritize migrating eligible
tasks, and we soft-limit ineligible tasks, allowing them
to migrate only when nr_balance_failed is non-zero to
avoid load-balancing trying very hard to balance the load.

Below are some benchmark test results. From my test results,
this patch shows a slight improvement on hackbench.

Benchmark
=========

All of the benchmarks are done inside a normal cpu cgroup in a
clean environment with cpu turbo disabled, and test machine is:

Single NUMA machine model is 13th Gen Intel(R) Core(TM)
i7-13700, 12 Core/24 HT.

Based on master b86545e02e8c.

Results
=======

hackbench-process-pipes
vanilla patched
Amean 1 0.5837 ( 0.00%) 0.5733 ( 1.77%)
Amean 4 1.4423 ( 0.00%) 1.4503 ( -0.55%)
Amean 7 2.5147 ( 0.00%) 2.4773 ( 1.48%)
Amean 12 3.9347 ( 0.00%) 3.8880 ( 1.19%)
Amean 21 5.3943 ( 0.00%) 5.3873 ( 0.13%)
Amean 30 6.7840 ( 0.00%) 6.6660 ( 1.74%)
Amean 48 9.8313 ( 0.00%) 9.6100 ( 2.25%)
Amean 79 15.4403 ( 0.00%) 14.9580 ( 3.12%)
Amean 96 18.4970 ( 0.00%) 17.9533 ( 2.94%)

hackbench-process-sockets
vanilla patched
Amean 1 0.6297 ( 0.00%) 0.6223 ( 1.16%)
Amean 4 2.1517 ( 0.00%) 2.0887 ( 2.93%)
Amean 7 3.6377 ( 0.00%) 3.5670 ( 1.94%)
Amean 12 6.1277 ( 0.00%) 5.9290 ( 3.24%)
Amean 21 10.0380 ( 0.00%) 9.7623 ( 2.75%)
Amean 30 14.1517 ( 0.00%) 13.7513 ( 2.83%)
Amean 48 24.7253 ( 0.00%) 24.2287 ( 2.01%)
Amean 79 43.9523 ( 0.00%) 43.2330 ( 1.64%)
Amean 96 54.5310 ( 0.00%) 53.7650 ( 1.40%)

tbench4 Throughput
vanilla patched
Hmean 1 255.97 ( 0.00%) 275.01 ( 7.44%)
Hmean 2 511.60 ( 0.00%) 544.27 ( 6.39%)
Hmean 4 996.70 ( 0.00%) 1006.57 ( 0.99%)
Hmean 8 1646.46 ( 0.00%) 1649.15 ( 0.16%)
Hmean 16 2259.42 ( 0.00%) 2274.35 ( 0.66%)
Hmean 32 4725.48 ( 0.00%) 4735.57 ( 0.21%)
Hmean 64 4411.47 ( 0.00%) 4400.05 ( -0.26%)
Hmean 96 4284.31 ( 0.00%) 4267.39 ( -0.39%)

Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Hao Jia <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 2cf9ac40 11-Jan-2025 Vincent Guittot <[email protected]>

sched/fair: Encapsulate set custom slice in a __setparam_fair() function

Similarly to dl, create a __setparam_fair() function to set parameters
related to fair class and move it in the fair.c file.

sched/fair: Encapsulate set custom slice in a __setparam_fair() function

Similarly to dl, create a __setparam_fair() function to set parameters
related to fair class and move it in the fair.c file.

No functional changes expected

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Phil Auld <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 66951e48 13-Jan-2025 Peter Zijlstra <[email protected]>

sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE

Normally dequeue_entities() will continue to dequeue an empty group entity;
except DELAY_DEQUEUE changes things -- it retains empty entities such

sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE

Normally dequeue_entities() will continue to dequeue an empty group entity;
except DELAY_DEQUEUE changes things -- it retains empty entities such that they
might continue to compete and burn off some lag.

However, doing this results in update_cfs_group() re-computing the cgroup
weight 'slice' for an empty group, which it (rightly) figures isn't much at
all. This in turn means that the delayed entity is not competing at the
expected weight. Worse, the very low weight causes its lag to be inflated,
which combined with avg_vruntime() using scale_load_down(), leads to artifacts.

As such, don't adjust the weight for empty group entities and let them compete
at their original weight.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 6d71a9c6 09-Jan-2025 Peter Zijlstra <[email protected]>

sched/fair: Fix EEVDF entity placement bug causing scheduling lag

I noticed this in my traces today:

turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885e

sched/fair: Fix EEVDF entity placement bug causing scheduling lag

I noticed this in my traces today:

turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
{ weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } ->
{ weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
{ weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } ->
{ weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }

Which is a weight transition: 1048576 -> 2 -> 1048576.

One would expect the lag to shoot out *AND* come back, notably:

-1123*1048576/2 = -588775424
-588775424*2/1048576 = -1123

Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.

This made me have a very hard look at reweight_entity(), and
specifically the ->on_rq case, which is more prominent with
DELAY_DEQUEUE.

And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().

With the below patch, I now see things like:

migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
{ weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } ->
{ weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
{ weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } ->
{ weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }

Which isn't perfect yet, but much closer.

Reported-by: Doug Smythies <[email protected]>
Reported-by: Ingo Molnar <[email protected]>
Tested-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.13-rc4
# 3b2a793e 20-Dec-2024 Swapnil Sapkal <[email protected]>

sched: Report the different kinds of imbalances in /proc/schedstat

In /proc/schedstat, lb_imbalance reports the sum of imbalances
discovered in sched domains with each call to sched_balance_rq(), wh

sched: Report the different kinds of imbalances in /proc/schedstat

In /proc/schedstat, lb_imbalance reports the sum of imbalances
discovered in sched domains with each call to sched_balance_rq(), which is
not very useful because lb_imbalance does not mention whether the imbalance
is due to load, utilization, nr_tasks or misfit_tasks. Remove this field
from /proc/schedstat.

Currently there is no field in /proc/schedstat to report different types
of imbalances. Introduce new fields in /proc/schedstat to report the
total imbalances in load, utilization, nr_tasks or misfit_tasks.

Added fields to /proc/schedstat:
- lb_imbalance_load: Total imbalance due to load.
- lb_imbalance_util: Total imbalance due to utilization.
- lb_imbalance_task: Total imbalance due to number of tasks.
- lb_imbalance_misfit: Total imbalance due to misfit tasks.

Signed-off-by: Swapnil Sapkal <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Shrikanth Hegde <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# c3856c9c 20-Dec-2024 Peter Zijlstra <[email protected]>

sched/fair: Cleanup in migrate_degrades_locality() to improve readability

migrate_degrade_locality() would return {1, 0, -1} respectively to
indicate that migration would degrade-locality, would imp

sched/fair: Cleanup in migrate_degrades_locality() to improve readability

migrate_degrade_locality() would return {1, 0, -1} respectively to
indicate that migration would degrade-locality, would improve
locality, would be ambivalent to locality improvements.

This patch improves readability by changing the return value to mean:
* Any positive value degrades locality
* 0 migration doesn't affect locality
* Any negative value improves locality

[Swapnil: Fixed comments around code and wrote commit log]

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Not-yet-signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Swapnil Sapkal <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# a430d99e 20-Dec-2024 Peter Zijlstra <[email protected]>

sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat

In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled
during load balance. This value is incremented in can_migr

sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat

In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled
during load balance. This value is incremented in can_migrate_task()
if the task is migratable and hot. After incrementing the value,
load balancer can still decide not to migrate this task leading to wrong
accounting. Fix this by incrementing stats when hot tasks are detached.
This issue only exists in detach_tasks() where we can decide to not
migrate hot task even if it is migratable. However, in detach_one_task(),
we migrate it unconditionally.

[Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log]

Fixes: d31980846f96 ("sched: Move up affinity check to mitigate useless redoing overhead")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reported-by: "Gautham R. Shenoy" <[email protected]>
Not-yet-signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Swapnil Sapkal <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# ee8118c1 19-Dec-2024 Sebastian Andrzej Siewior <[email protected]>

sched/fair: Update comments after sched_tick() rename.

scheduler_tick() was renamed to sched_tick() in 86dd6c04ef9f2
("sched/balancing: Rename scheduler_tick() => sched_tick()").

Update comments st

sched/fair: Update comments after sched_tick() rename.

scheduler_tick() was renamed to sched_tick() in 86dd6c04ef9f2
("sched/balancing: Rename scheduler_tick() => sched_tick()").

Update comments still referring to scheduler_tick.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


Revision tags: v6.13-rc3
# af98d8a3 12-Dec-2024 Vishal Chourasia <[email protected]>

sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug

CPU controller limits are not properly enforced during CPU hotplug
operations, particularly during CPU offline. When a CPU goes offline,

sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug

CPU controller limits are not properly enforced during CPU hotplug
operations, particularly during CPU offline. When a CPU goes offline,
throttled processes are unintentionally being unthrottled across all CPUs
in the system, allowing them to exceed their assigned quota limits.

Consider below for an example,

Assigning 6.25% bandwidth limit to a cgroup
in a 8 CPU system, where, workload is running 8 threads for 20 seconds at
100% CPU utilization, expected (user+sys) time = 10 seconds.

$ cat /sys/fs/cgroup/test/cpu.max
50000 100000

$ ./ebizzy -t 8 -S 20 // non-hotplug case
real 20.00 s
user 10.81 s // intended behaviour
sys 0.00 s

$ ./ebizzy -t 8 -S 20 // hotplug case
real 20.00 s
user 14.43 s // Workload is able to run for 14 secs
sys 0.00 s // when it should have only run for 10 secs

During CPU hotplug, scheduler domains are rebuilt and cpu_attach_domain
is called for every active CPU to update the root domain. That ends up
calling rq_offline_fair which un-throttles any throttled hierarchies.

Unthrottling should only occur for the CPU being hotplugged to allow its
throttled processes to become runnable and get migrated to other CPUs.

With current patch applied,
$ ./ebizzy -t 8 -S 20 // hotplug case
real 21.00 s
user 10.16 s // intended behaviour
sys 0.00 s

This also has another symptom, when a CPU goes offline, and if the cfs_rq
is not in throttled state and the runtime_remaining still had plenty
remaining, it gets reset to 1 here, causing the runtime_remaining of
cfs_rq to be quickly depleted.

Note: hotplug operation (online, offline) was performed in while(1) loop

v3: https://lore.kernel.org/all/[email protected]
v2: https://lore.kernel.org/all/[email protected]
v1: https://lore.kernel.org/all/[email protected]
Suggested-by: Zhang Qiao <[email protected]>
Signed-off-by: Vishal Chourasia <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
Tested-by: Madadi Vineeth Reddy <[email protected]>
Tested-by: Samir Mulani <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# c7f7e9c7 13-Dec-2024 Vineeth Pillai (Google) <[email protected]>

sched/dlserver: Fix dlserver time accounting

dlserver time is accounted when:
- dlserver is active and the dlserver proxies the cfs task.
- dlserver is active but deferred and cfs task runs after

sched/dlserver: Fix dlserver time accounting

dlserver time is accounted when:
- dlserver is active and the dlserver proxies the cfs task.
- dlserver is active but deferred and cfs task runs after being picked
through the normal fair class pick.

dl_server_update is called in two places to make sure that both the
above times are accounted for. But it doesn't check if dlserver is
active or not. Now that we have this dl_server_active flag, we can
consolidate dl_server_update into one place and all we need to check is
whether dlserver is active or not. When dlserver is active there is only
two possible conditions:
- dlserver is deferred.
- cfs task is running on behalf of dlserver.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Signed-off-by: "Vineeth Pillai (Google)" <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Marcel Ziswiler <[email protected]> # ROCK 5B
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.13-rc2, v6.13-rc1
# 2a77e4be 29-Nov-2024 Peter Zijlstra <[email protected]>

sched/fair: Untangle NEXT_BUDDY and pick_next_task()

There are 3 sites using set_next_buddy() and only one is conditional
on NEXT_BUDDY, the other two sites are unconditional; to note:

- yield_to

sched/fair: Untangle NEXT_BUDDY and pick_next_task()

There are 3 sites using set_next_buddy() and only one is conditional
on NEXT_BUDDY, the other two sites are unconditional; to note:

- yield_to_task()
- cgroup dequeue / pick optimization

However, having NEXT_BUDDY control both the wakeup-preemption and the
picking side of things means its near useless.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


12345678910>>...57