History log of /linux-6.15/kernel/sched/ext_idle.c (Results 1 – 12 of 12)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4
# 18853ba7 22-Apr-2025 Andrea Righi <[email protected]>

sched_ext: Track currently locked rq

Some kfuncs provided by sched_ext may need to operate on a struct rq,
but they can be invoked from various contexts, specifically, different
scx callbacks.

Whil

sched_ext: Track currently locked rq

Some kfuncs provided by sched_ext may need to operate on a struct rq,
but they can be invoked from various contexts, specifically, different
scx callbacks.

While some of these callbacks are invoked with a particular rq already
locked, others are not. This makes it impossible for a kfunc to reliably
determine whether it's safe to access a given rq, triggering potential
bugs or unsafe behaviors, see for example [1].

To address this, track the currently locked rq whenever a sched_ext
callback is invoked via SCX_CALL_OP*().

This allows kfuncs that need to operate on an arbitrary rq to retrieve
the currently locked one and apply the appropriate action as needed.

[1] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Acked-by: Changwoo Min <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.15-rc3, v6.15-rc2, v6.15-rc1
# f0c6eab5 25-Mar-2025 Andrea Righi <[email protected]>

sched_ext: initialize built-in idle state before ops.init()

A BPF scheduler may want to use the built-in idle cpumasks in ops.init()
before the scheduler is fully initialized, either directly or thr

sched_ext: initialize built-in idle state before ops.init()

A BPF scheduler may want to use the built-in idle cpumasks in ops.init()
before the scheduler is fully initialized, either directly or through a
BPF timer for example.

However, this would result in an error, since the idle state has not
been properly initialized yet.

This can be easily verified by modifying scx_simple to call
scx_bpf_get_idle_cpumask() in ops.init():

$ sudo scx_simple

DEBUG DUMP
===========================================================================

scx_simple[121] triggered exit kind 1024:
runtime error (built-in idle tracking is disabled)
...

Fix this by properly initializing the idle state before ops.init() is
called. With this change applied:

$ sudo scx_simple
local=2 global=0
local=19 global=11
local=23 global=11
...

Fixes: d73249f88743d ("sched_ext: idle: Make idle static keys private")
Signed-off-by: Andrea Righi <[email protected]>
Reviewed-by: Joel Fernandes <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# 883cc35e 26-Mar-2025 Tejun Heo <[email protected]>

sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()

scx_select_cpu_dfl() has a meaningless conditional goto at the end. Remove
it. No functional changes.

Signed-off-by: Tejun H

sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()

scx_select_cpu_dfl() has a meaningless conditional goto at the end. Remove
it. No functional changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Andrea Righi <[email protected]>

show more ...


# 37477d9e 26-Mar-2025 Andrea Righi <[email protected]>

sched_ext: idle: Fix return code of scx_select_cpu_dfl()

Return -EBUSY when using %SCX_PICK_IDLE_CORE with scx_select_cpu_dfl()
if a fully idle SMT core cannot be found, instead of falling back to
@

sched_ext: idle: Fix return code of scx_select_cpu_dfl()

Return -EBUSY when using %SCX_PICK_IDLE_CORE with scx_select_cpu_dfl()
if a fully idle SMT core cannot be found, instead of falling back to
@prev_cpu, which is not a fully idle SMT core in this case.

Fixes: c414c2171cd9e ("sched_ext: idle: Honor idle flags in the built-in idle selection policy")
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14, v6.14-rc7
# e4855fc9 14-Mar-2025 Andrea Righi <[email protected]>

sched_ext: idle: Refactor scx_select_cpu_dfl()

Make scx_select_cpu_dfl() more consistent with the other idle-related
APIs by returning a negative value when an idle CPU isn't found.

No functional c

sched_ext: idle: Refactor scx_select_cpu_dfl()

Make scx_select_cpu_dfl() more consistent with the other idle-related
APIs by returning a negative value when an idle CPU isn't found.

No functional changes, this is purely a refactoring.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# c414c217 14-Mar-2025 Andrea Righi <[email protected]>

sched_ext: idle: Honor idle flags in the built-in idle selection policy

Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(),
to enforce strict selection criteria, such as selecting

sched_ext: idle: Honor idle flags in the built-in idle selection policy

Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(),
to enforce strict selection criteria, such as selecting an idle CPU
strictly within @prev_cpu's node or choosing only a fully idle SMT core.

This functionality will be exposed through a dedicated kfunc in a
separate patch.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14-rc6, v6.14-rc5
# fde7d647 25-Feb-2025 Andrea Righi <[email protected]>

sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior

When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node()
should always return a CPU from the specified node, regardless of its

sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior

When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node()
should always return a CPU from the specified node, regardless of its
idle state.

Also clarify this logic in the function documentation.

Fixes: 01059219b0cfd ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers")
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14-rc4
# 01059219 18-Feb-2025 Andrea Righi <[email protected]>

sched_ext: idle: Introduce node-aware idle cpu kfunc helpers

Introduce a new kfunc to retrieve the node associated to a CPU:

int scx_bpf_cpu_node(s32 cpu)

Add the following kfuncs to provide BPF

sched_ext: idle: Introduce node-aware idle cpu kfunc helpers

Introduce a new kfunc to retrieve the node associated to a CPU:

int scx_bpf_cpu_node(s32 cpu)

Add the following kfuncs to provide BPF schedulers direct access to
per-node idle cpumasks information:

const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
int node, u64 flags)
s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed,
int node, u64 flags)

Moreover, trigger an scx error when any of the non-node aware idle CPU
kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.

Cc: Yury Norov [NVIDIA] <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Reviewed-by: Yury Norov [NVIDIA] <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14-rc3
# 48849271 14-Feb-2025 Andrea Righi <[email protected]>

sched_ext: idle: Per-node idle cpumasks

Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, sinc

sched_ext: idle: Per-node idle cpumasks

Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.

Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.

The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.

The split of multiple per-node idle cpumasks can be controlled using the
SCX_OPS_BUILTIN_IDLE_PER_NODE flag.

By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
host-wide idle cpumask is used, maintaining the previous behavior.

NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.

= Test =

Hardware:
- System: DGX B200
- CPUs: 224 SMT threads (112 physical cores)
- Processor: INTEL(R) XEON(R) PLATINUM 8570
- 2 NUMA nodes

Scheduler:
- scx_simple [1] (so that we can focus at the built-in idle selection
policy and not at the scheduling policy itself)

Test:
- Run a parallel kernel build `make -j $(nproc)` and measure the average
elapsed time over 10 runs:

avg time | stdev
---------+------
before: 52.431s | 2.895
after: 50.342s | 2.895

= Conclusion =

Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.

The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @
2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.

On smaller systems, I haven't noticed any measurable regressions or
improvements with the same test (parallel kernel build) and scheduler
(scx_simple).

Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs
I observed an additional +2-2.5% performance improvement with the same
test.

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c

Cc: Yury Norov [NVIDIA] <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Reviewed-by: Yury Norov [NVIDIA] <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# 0aaaf89d 14-Feb-2025 Andrea Righi <[email protected]>

sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE

Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows
BPF schedulers to select between using a global flat idle cpumask or
m

sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE

Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows
BPF schedulers to select between using a global flat idle cpumask or
multiple per-node cpumasks.

This only introduces the flag and the mechanism to enable/disable this
feature without affecting any scheduling behavior.

Cc: Yury Norov [NVIDIA] <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Reviewed-by: Yury Norov [NVIDIA] <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# d73249f8 14-Feb-2025 Andrea Righi <[email protected]>

sched_ext: idle: Make idle static keys private

Make all the static keys used by the idle CPU selection policy private
to ext_idle.c. This avoids unnecessary exposure in headers and improves
code enc

sched_ext: idle: Make idle static keys private

Make all the static keys used by the idle CPU selection policy private
to ext_idle.c. This avoids unnecessary exposure in headers and improves
code encapsulation.

Cc: Yury Norov <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14-rc2, v6.14-rc1
# 337d1b35 27-Jan-2025 Andrea Righi <[email protected]>

sched_ext: Move built-in idle CPU selection policy to a separate file

As ext.c is becoming quite large, move the idle CPU selection policy to
separate files (ext_idle.c / ext_idle.h) for better code

sched_ext: Move built-in idle CPU selection policy to a separate file

As ext.c is becoming quite large, move the idle CPU selection policy to
separate files (ext_idle.c / ext_idle.h) for better code readability.

Moreover, group together all the idle CPU selection kfunc's to the same
btf_kfunc_id_set block.

No functional changes, this is purely code reorganization.

Suggested-by: Yury Norov <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...