|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4 |
|
| #
18853ba7 |
| 22-Apr-2025 |
Andrea Righi <[email protected]> |
sched_ext: Track currently locked rq
Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks.
Whil
sched_ext: Track currently locked rq
Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks.
While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1].
To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*().
This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed.
[1] https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Acked-by: Changwoo Min <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
f0c6eab5 |
| 25-Mar-2025 |
Andrea Righi <[email protected]> |
sched_ext: initialize built-in idle state before ops.init()
A BPF scheduler may want to use the built-in idle cpumasks in ops.init() before the scheduler is fully initialized, either directly or thr
sched_ext: initialize built-in idle state before ops.init()
A BPF scheduler may want to use the built-in idle cpumasks in ops.init() before the scheduler is fully initialized, either directly or through a BPF timer for example.
However, this would result in an error, since the idle state has not been properly initialized yet.
This can be easily verified by modifying scx_simple to call scx_bpf_get_idle_cpumask() in ops.init():
$ sudo scx_simple
DEBUG DUMP ===========================================================================
scx_simple[121] triggered exit kind 1024: runtime error (built-in idle tracking is disabled) ...
Fix this by properly initializing the idle state before ops.init() is called. With this change applied:
$ sudo scx_simple local=2 global=0 local=19 global=11 local=23 global=11 ...
Fixes: d73249f88743d ("sched_ext: idle: Make idle static keys private") Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Joel Fernandes <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
883cc35e |
| 26-Mar-2025 |
Tejun Heo <[email protected]> |
sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()
scx_select_cpu_dfl() has a meaningless conditional goto at the end. Remove it. No functional changes.
Signed-off-by: Tejun H
sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()
scx_select_cpu_dfl() has a meaningless conditional goto at the end. Remove it. No functional changes.
Signed-off-by: Tejun Heo <[email protected]> Cc: Andrea Righi <[email protected]>
show more ...
|
| #
37477d9e |
| 26-Mar-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Fix return code of scx_select_cpu_dfl()
Return -EBUSY when using %SCX_PICK_IDLE_CORE with scx_select_cpu_dfl() if a fully idle SMT core cannot be found, instead of falling back to @
sched_ext: idle: Fix return code of scx_select_cpu_dfl()
Return -EBUSY when using %SCX_PICK_IDLE_CORE with scx_select_cpu_dfl() if a fully idle SMT core cannot be found, instead of falling back to @prev_cpu, which is not a fully idle SMT core in this case.
Fixes: c414c2171cd9e ("sched_ext: idle: Honor idle flags in the built-in idle selection policy") Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7 |
|
| #
e4855fc9 |
| 14-Mar-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Refactor scx_select_cpu_dfl()
Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found.
No functional c
sched_ext: idle: Refactor scx_select_cpu_dfl()
Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found.
No functional changes, this is purely a refactoring.
Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
c414c217 |
| 14-Mar-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Honor idle flags in the built-in idle selection policy
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting
sched_ext: idle: Honor idle flags in the built-in idle selection policy
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting an idle CPU strictly within @prev_cpu's node or choosing only a fully idle SMT core.
This functionality will be exposed through a dedicated kfunc in a separate patch.
Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5 |
|
| #
fde7d647 |
| 25-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior
When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node() should always return a CPU from the specified node, regardless of its
sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior
When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node() should always return a CPU from the specified node, regardless of its idle state.
Also clarify this logic in the function documentation.
Fixes: 01059219b0cfd ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers") Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4 |
|
| #
01059219 |
| 18-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
Introduce a new kfunc to retrieve the node associated to a CPU:
int scx_bpf_cpu_node(s32 cpu)
Add the following kfuncs to provide BPF
sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
Introduce a new kfunc to retrieve the node associated to a CPU:
int scx_bpf_cpu_node(s32 cpu)
Add the following kfuncs to provide BPF schedulers direct access to per-node idle cpumasks information:
const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags)
Moreover, trigger an scx error when any of the non-node aware idle CPU kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.
Cc: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc3 |
|
| #
48849271 |
| 14-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, sinc
sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask.
Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems.
The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node.
The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag.
By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior.
NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks.
= Test =
Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes
Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself)
Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs:
avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895
= Conclusion =
Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case.
The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.
On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple).
Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test.
[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c
Cc: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
0aaaf89d |
| 14-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or m
sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks.
This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior.
Cc: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
d73249f8 |
| 14-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Make idle static keys private
Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code enc
sched_ext: idle: Make idle static keys private
Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code encapsulation.
Cc: Yury Norov <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1 |
|
| #
337d1b35 |
| 27-Jan-2025 |
Andrea Righi <[email protected]> |
sched_ext: Move built-in idle CPU selection policy to a separate file
As ext.c is becoming quite large, move the idle CPU selection policy to separate files (ext_idle.c / ext_idle.h) for better code
sched_ext: Move built-in idle CPU selection policy to a separate file
As ext.c is becoming quite large, move the idle CPU selection policy to separate files (ext_idle.c / ext_idle.h) for better code readability.
Moreover, group together all the idle CPU selection kfunc's to the same btf_kfunc_id_set block.
No functional changes, this is purely code reorganization.
Suggested-by: Yury Norov <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|