| 038730dc | 04-Mar-2025 |
Changwoo Min <[email protected]> |
sched_ext: Change the event type from u64 to s64
The event count could be negative in the future, so change the event type from u64 to s64.
Signed-off-by: Changwoo Min <[email protected]> Signed-
sched_ext: Change the event type from u64 to s64
The event count could be negative in the future, so change the event type from u64 to s64.
Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| b214b04d | 27-Feb-2025 |
Andrea Righi <[email protected]> |
tools/sched_ext: Provide a compatible helper for scx_bpf_events()
Introduce __COMPAT_scx_bpf_events() to use scx_bpf_events() in a compatible way also with kernels that don't provide this kfunc.
Th
tools/sched_ext: Provide a compatible helper for scx_bpf_events()
Introduce __COMPAT_scx_bpf_events() to use scx_bpf_events() in a compatible way also with kernels that don't provide this kfunc.
This also fixes the following error with scx_qmap when running on a kernel that does not provide scx_bpf_events():
; scx_bpf_events(&events, sizeof(events)); @ scx_qmap.bpf.c:777 318: (b7) r2 = 72 ; R2_w=72 async_cb 319: <invalid kfunc call> kfunc 'scx_bpf_events' is referenced but wasn't resolved
Fixes: 9865f31d852a4 ("sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers") Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| 5e3b6424 | 24-Feb-2025 |
Andrea Righi <[email protected]> |
tools/sched_ext: Provide consistent access to scx flags
Make all the SCX_OPS_* and SCX_PICK_IDLE_* flags available to the user-space part of the schedulers via the compat interface.
This allows sch
tools/sched_ext: Provide consistent access to scx flags
Make all the SCX_OPS_* and SCX_PICK_IDLE_* flags available to the user-space part of the schedulers via the compat interface.
This allows schedulers / selftests to set all the ops flags in user-space, rather than having them split between BPF and user-space.
Signed-off-by: Andrea Righi <[email protected]> Acked-by: Changwoo Min <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| 0e9b4c10 | 24-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Introduce scx_bpf_nr_node_ids()
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system.
BPF schedul
sched_ext: idle: Introduce scx_bpf_nr_node_ids()
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system.
BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc.
Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| 01059219 | 18-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
Introduce a new kfunc to retrieve the node associated to a CPU:
int scx_bpf_cpu_node(s32 cpu)
Add the following kfuncs to provide BPF
sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
Introduce a new kfunc to retrieve the node associated to a CPU:
int scx_bpf_cpu_node(s32 cpu)
Add the following kfuncs to provide BPF schedulers direct access to per-node idle cpumasks information:
const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags)
Moreover, trigger an scx error when any of the non-node aware idle CPU kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.
Cc: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| 48849271 | 14-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, sinc
sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask.
Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems.
The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node.
The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag.
By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior.
NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks.
= Test =
Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes
Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself)
Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs:
avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895
= Conclusion =
Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case.
The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.
On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple).
Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test.
[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c
Cc: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| 2e2006c9 | 12-Feb-2025 |
Chuyi Zhou <[email protected]> |
sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.
Now BPF only supports bpf_list_push_{front,back}_impl kfunc, not bpf_list_ push_{front,back}.
This patch fix this issue. Without thi
sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.
Now BPF only supports bpf_list_push_{front,back}_impl kfunc, not bpf_list_ push_{front,back}.
This patch fix this issue. Without this patch, if we use bpf_list kfunc in scx, the BPF verifier would complain:
libbpf: extern (func ksym) 'bpf_list_push_back': not found in kernel or module BTFs libbpf: failed to load object 'scx_foo' libbpf: failed to load BPF skeleton 'scx_foo': -EINVAL
With this patch, the bpf list kfunc will work as expected.
Signed-off-by: Chuyi Zhou <[email protected]> Fixes: 2a52ca7c98960 ("sched_ext: Add scx_simple and scx_example_qmap example schedulers") Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| 372033ad | 09-Feb-2025 |
Changwoo Min <[email protected]> |
tools/sched_ext: Compatible testing of SCX_ENQ_CPU_SELECTED
This provides compatible testing of SCX_ENQ_CPU_SELECTED. More specifically, it handles two cases:
1. a BPF scheduler is compiled again
tools/sched_ext: Compatible testing of SCX_ENQ_CPU_SELECTED
This provides compatible testing of SCX_ENQ_CPU_SELECTED. More specifically, it handles two cases:
1. a BPF scheduler is compiled against vmlinux.h where SCX_ENQ_CPU_SELECTED is defined, but it runs on a kernel that does not have SCX_ENQ_CPU_SELECTED. In this case, the test result of 'enq_flags & SCX_ENQ_CPU_SELECTED' will always be false. That test result is semantically incorrect because the kernel before SCX_ENQ_CPU_SELECTED has never skipped select_task_rq_scx(), so the result should be true.
2. a BPF scheduler is compiling against vmlinux.h where SCX_ENQ_CPU_SELECTED is not defined. In this case, directly using SCX_ENQ_CPU_SELECTED causes compilation errors.
To hide such complexity, introduce __COMPAT_is_enq_cpu_selected(), which checks if SCX_ENQ_CPU_SELECTED exists in runtime using BPF CO-RE. This consists of three parts:
1. Add enum_defs.autogen.h, which has macros (HAVE_{enum name}) denoting whether SCX enums are defined in the vmlinux.h or not.
2. Implement __COMPAT_is_enq_cpu_selected(), which provide the test of SCX_ENQ_CPU_SELECTED in a compatible way.
3. Use __COMPAT_is_enq_cpu_selected() in scx_qmap.
Note that this is a sync of the relevant PR [1] in the scx repo.
[1] https://github.com/sched-ext/scx/pull/1314
Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| 46a0e161 | 09-Feb-2025 |
Tejun Heo <[email protected]> |
tool/sched_ext: Event counter dumping updates
- There's no need to dump event counters from both scx_qmap and scx_central. Drop counter dumping from scx_central.
- bpf_printk() implies a trailing
tool/sched_ext: Event counter dumping updates
- There's no need to dump event counters from both scx_qmap and scx_central. Drop counter dumping from scx_central.
- bpf_printk() implies a trailing new line and the explicit new line leads to double new lines. Drop the explicit new lines.
Signed-off-by: Tejun Heo <[email protected]> Acked-by: Changwoo Min <[email protected]>
show more ...
|
| 38d65cd6 | 07-Feb-2025 |
Changwoo Min <[email protected]> |
sched_ext: Print an event, SCX_EV_ENQ_SLICE_DFL, in scx_qmap/central
Modify the scx_qmap and scx_celtral schedulers to print the SCX_EV_ENQ_SLICE_DFL event every second.
Signed-off-by: Changwoo Min
sched_ext: Print an event, SCX_EV_ENQ_SLICE_DFL, in scx_qmap/central
Modify the scx_qmap and scx_celtral schedulers to print the SCX_EV_ENQ_SLICE_DFL event every second.
Signed-off-by: Changwoo Min <[email protected]> Acked-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| 029b6ce7 | 02-Feb-2025 |
Changwoo Min <[email protected]> |
sched_ext: Fix incorrect time delta calculation in time_delta()
When (s64)(after - before) > 0, the code returns the result of (s64)(after - before) > 0 while the intended result should be (s64)(aft
sched_ext: Fix incorrect time delta calculation in time_delta()
When (s64)(after - before) > 0, the code returns the result of (s64)(after - before) > 0 while the intended result should be (s64)(after - before). That happens because the middle operand of the ternary operator was omitted incorrectly, returning the result of (s64)(after - before) > 0. Thus, add the middle operand -- (s64)(after - before) -- to return the correct time calculation.
Fixes: d07be814fc71 ("sched_ext: Add time helpers for BPF schedulers") Signed-off-by: Changwoo Min <[email protected]> Acked-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| 0f130bc3 | 09-Jan-2025 |
Changwoo Min <[email protected]> |
sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()
In the BPF schedulers that use bpf_ktime_get_ns() -- scx_central and scx_flatcg, replace bpf_ktime_get_ns() calls to scx_bpf_now().
Signed-off
sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()
In the BPF schedulers that use bpf_ktime_get_ns() -- scx_central and scx_flatcg, replace bpf_ktime_get_ns() calls to scx_bpf_now().
Signed-off-by: Changwoo Min <[email protected]> Acked-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| d07be814 | 09-Jan-2025 |
Changwoo Min <[email protected]> |
sched_ext: Add time helpers for BPF schedulers
The following functions are added for BPF schedulers: - time_delta(after, before) - time_after(a, b) - time_before(a, b) - time_after_eq(a, b) - time_b
sched_ext: Add time helpers for BPF schedulers
The following functions are added for BPF schedulers: - time_delta(after, before) - time_after(a, b) - time_before(a, b) - time_after_eq(a, b) - time_before_eq(a, b) - time_in_range(a, b, c) - time_in_range_open(a, b, c)
Signed-off-by: Changwoo Min <[email protected]> Acked-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|