|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6 |
|
| #
428dc9fc |
| 05-May-2025 |
Tejun Heo <[email protected]> |
sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator
BPF programs may call next() and destroy() on BPF iterators even after new() returns an error value (e.g. bpf_for_each() macro ign
sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator
BPF programs may call next() and destroy() on BPF iterators even after new() returns an error value (e.g. bpf_for_each() macro ignores error returns from new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized state after an error return causing bpf_iter_scx_dsq_next() to dereference garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that next() and destroy() become noops.
Signed-off-by: Tejun Heo <[email protected]> Fixes: 650ba21b131e ("sched_ext: Implement DSQ iterator") Cc: [email protected] # v6.12+ Acked-by: Andrea Righi <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc5 |
|
| #
e38be1c7 |
| 28-Apr-2025 |
Andrea Righi <[email protected]> |
sched_ext: Fix rq lock state in hotplug ops
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume that the rq involved in the operation is locked, which is not the case during hotp
sched_ext: Fix rq lock state in hotplug ops
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume that the rq involved in the operation is locked, which is not the case during hotplug, triggering the following warning:
WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340
Fix by not tracking the target rq as locked in the context of ops.cpu_online() and ops.cpu_offline().
Fixes: 18853ba782bef ("sched_ext: Track currently locked rq") Reported-by: Tejun Heo <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Tested-by: Changwoo Min <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc4 |
|
| #
e7dcd130 |
| 25-Apr-2025 |
Andrea Righi <[email protected]> |
sched_ext: Remove duplicate BTF_ID_FLAGS definitions
Some kfuncs specific to the idle CPU selection policy are registered in both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though the
sched_ext: Remove duplicate BTF_ID_FLAGS definitions
Some kfuncs specific to the idle CPU selection policy are registered in both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though they should only be defined in the latter.
Remove the duplicates from scx_kfunc_ids_any.
Fixes: 337d1b354a297 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
a11d6784 |
| 22-Apr-2025 |
Andrea Righi <[email protected]> |
sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
scx_bpf_cpuperf_set() can be used to set a performance target level on any CPU. However, it doesn't correctly acquire the corresponding rq loc
sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
scx_bpf_cpuperf_set() can be used to set a performance target level on any CPU. However, it doesn't correctly acquire the corresponding rq lock, which may lead to unsafe behavior and trigger the following warning, due to the lockdep_assert_rq_held() check:
[ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0 ... [ 51.713836] Call trace: [ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P) [ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440 [ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c [ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0 [ 51.713845] bpf_scx_reg+0x18/0x30 [ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0 [ 51.713849] __sys_bpf+0x1934/0x22a0
Fix by properly acquiring the rq lock when possible or raising an error if we try to operate on a CPU that is not the one currently locked.
Fixes: d86adb4fc0655 ("sched_ext: Add cpuperf support") Signed-off-by: Andrea Righi <[email protected]> Acked-by: Changwoo Min <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
18853ba7 |
| 22-Apr-2025 |
Andrea Righi <[email protected]> |
sched_ext: Track currently locked rq
Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks.
Whil
sched_ext: Track currently locked rq
Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks.
While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1].
To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*().
This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed.
[1] https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Acked-by: Changwoo Min <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc3, v6.15-rc2 |
|
| #
bc08b15b |
| 08-Apr-2025 |
Tejun Heo <[email protected]> |
sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup weight support warnings. Now that the warnings are removed, the flag
sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup weight support warnings. Now that the warnings are removed, the flag doesn't do anything. Mark it for deprecation and remove its usage from scx_flatcg.
v2: Actually include the scx_flatcg update.
Signed-off-by: Tejun Heo <[email protected]> Suggested-and-reviewed-by: Andrea Righi <[email protected]>
show more ...
|
| #
e776b26e |
| 07-Apr-2025 |
Tejun Heo <[email protected]> |
sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings
sched_ext generates warnings when cpu.weight / cpu.idle are set to non-default values if the BPF scheduler doesn't implement weight sup
sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings
sched_ext generates warnings when cpu.weight / cpu.idle are set to non-default values if the BPF scheduler doesn't implement weight support. These warnings don't provide much value while adding constant annoyance. A BPF scheduler may not implement any particular behavior and there's nothing particularly special about missing cgroup weight support. Drop the warnings.
Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
47068309 |
| 08-Apr-2025 |
Breno Leitao <[email protected]> |
sched_ext: Use kvzalloc for large exit_dump allocation
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which can require large contiguous memory depending on the implementation. T
sched_ext: Use kvzalloc for large exit_dump allocation
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which can require large contiguous memory depending on the implementation. This change prevents allocation failures by allowing the system to fall back to vmalloc when contiguous memory allocation fails.
Since this buffer is only used for debugging purposes, physical memory contiguity is not required, making vmalloc a suitable alternative.
Cc: [email protected] Fixes: 07814a9439a3b0 ("sched_ext: Print debug dump after an error exit") Suggested-by: Rik van Riel <[email protected]> Signed-off-by: Breno Leitao <[email protected]> Acked-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc1 |
|
| #
f0c6eab5 |
| 25-Mar-2025 |
Andrea Righi <[email protected]> |
sched_ext: initialize built-in idle state before ops.init()
A BPF scheduler may want to use the built-in idle cpumasks in ops.init() before the scheduler is fully initialized, either directly or thr
sched_ext: initialize built-in idle state before ops.init()
A BPF scheduler may want to use the built-in idle cpumasks in ops.init() before the scheduler is fully initialized, either directly or through a BPF timer for example.
However, this would result in an error, since the idle state has not been properly initialized yet.
This can be easily verified by modifying scx_simple to call scx_bpf_get_idle_cpumask() in ops.init():
$ sudo scx_simple
DEBUG DUMP ===========================================================================
scx_simple[121] triggered exit kind 1024: runtime error (built-in idle tracking is disabled) ...
Fix this by properly initializing the idle state before ops.init() is called. With this change applied:
$ sudo scx_simple local=2 global=0 local=19 global=11 local=23 global=11 ...
Fixes: d73249f88743d ("sched_ext: idle: Make idle static keys private") Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Joel Fernandes <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
a8897ed8 |
| 25-Mar-2025 |
Jake Hillion <[email protected]> |
sched_ext: create_dsq: Return -EEXIST on duplicate request
create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently ignore duplicate entries. As a sched_ext scheduler is creating eac
sched_ext: create_dsq: Return -EEXIST on duplicate request
create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently ignore duplicate entries. As a sched_ext scheduler is creating each DSQ for a different purpose this is surprising behaviour.
Replace rhashtable_insert_fast which ignores duplicates with rhashtable_lookup_insert_fast that reports duplicates (though doesn't return their value). The rest of the code is structured correctly and this now returns -EEXIST.
Tested by adding an extra scx_bpf_create_dsq to scx_simple. Previously this was ignored, now init fails with a -17 code. Also ran scx_lavd which continued to work well.
Signed-off-by: Jake Hillion <[email protected]> Acked-by: Andrea Righi <[email protected]> Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: [email protected] # v6.12+ Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14 |
|
| #
f7d2728c |
| 17-Mar-2025 |
Ingo Molnar <[email protected]> |
sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG.
Since CONFIG_SCHED_DEBUG is getting removed, convert S
sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG.
Since CONFIG_SCHED_DEBUG is getting removed, convert SCHED_WARN() to WARN_ON_ONCE().
Note that the warning output isn't 100% equivalent:
#define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
Because SCHED_WARN_ON() would output the 'x' condition as well, while WARN_ONCE() will only show a backtrace.
Hopefully these are rare enough to not really matter.
If it does, we should probably introduce a new WARN_ON() variant that outputs the condition in stringified form, or improve WARN_ON() itself.
Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Shrikanth Hegde <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Ben Segall <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.14-rc7 |
|
| #
e4855fc9 |
| 14-Mar-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Refactor scx_select_cpu_dfl()
Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found.
No functional c
sched_ext: idle: Refactor scx_select_cpu_dfl()
Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found.
No functional changes, this is purely a refactoring.
Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
c414c217 |
| 14-Mar-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Honor idle flags in the built-in idle selection policy
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting
sched_ext: idle: Honor idle flags in the built-in idle selection policy
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting an idle CPU strictly within @prev_cpu's node or choosing only a fully idle SMT core.
This functionality will be exposed through a dedicated kfunc in a separate patch.
Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
97e13ecb |
| 13-Mar-2025 |
Andrea Righi <[email protected]> |
sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()
scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give tasks that are queued to the local DSQ a chance to migrate to other
sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()
scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give tasks that are queued to the local DSQ a chance to migrate to other CPUs, when a CPU is taken by a higher scheduling class.
However, there is no point re-enqueuing tasks that can only run on that particular CPU, as they would simply be re-added to the same local DSQ without any benefit.
Therefore, skip per-CPU tasks in scx_bpf_reenqueue_local().
Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
71d07880 |
| 04-Mar-2025 |
Changwoo Min <[email protected]> |
sched_ext: Add trace point to track sched_ext core events
Add tracing support to track sched_ext core events (/sched_ext/sched_ext_event). This may be useful for debugging sched_ext schedulers that
sched_ext: Add trace point to track sched_ext core events
Add tracing support to track sched_ext core events (/sched_ext/sched_ext_event). This may be useful for debugging sched_ext schedulers that trigger a particular event.
The trace point can be used as other trace points, so it can be used in, for example, `perf trace` and BPF programs, as follows:
====== $> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"' ======
====== struct tp_sched_ext_event { struct trace_entry ent; u32 __data_loc_name; s64 delta; };
SEC("tracepoint/sched_ext/sched_ext_event") int rtp_add_event(struct tp_sched_ext_event *ctx) { char event_name[128]; unsigned short offset = ctx->__data_loc_name & 0xFFFF; bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset);
bpf_printk("name %s delta %lld", event_name, ctx->delta); return 0; } ======
Signed-off-by: Changwoo Min <[email protected]> Acked-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
038730dc |
| 04-Mar-2025 |
Changwoo Min <[email protected]> |
sched_ext: Change the event type from u64 to s64
The event count could be negative in the future, so change the event type from u64 to s64.
Signed-off-by: Changwoo Min <[email protected]> Signed-
sched_ext: Change the event type from u64 to s64
The event count could be negative in the future, so change the event type from u64 to s64.
Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
9360dfe4 |
| 03-Mar-2025 |
Andrea Righi <[email protected]> |
sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel crash
sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel crash.
To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and trigger an scx error if an invalid CPU is specified.
Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class") Cc: [email protected] # v6.12+ Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5 |
|
| #
8fef0a3b |
| 25-Feb-2025 |
Tejun Heo <[email protected]> |
sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") ad
sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") added a workaround to handle the cases where pick_task_scx() is called without prececing balance_scx() which is due to a fair class bug where pick_taks_fair() may return NULL after a true return from balance_fair().
The workaround detects when pick_task_scx() is called without preceding balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid stalling. Unfortunately, the workaround code was testing whether @prev was on SCX to decide whether to keep the task running. This is incorrect as the task may be on SCX but no longer runnable.
This could lead to a non-runnable task to be returned from pick_task_scx() which cause interesting confusions and failures. e.g. A common failure mode is the task ending up with (!on_rq && on_cpu) state which can cause potential wakers to busy loop, which can easily lead to deadlocks.
Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes @prev_on_scx only used in one place. Open code the usage and improve the comment while at it.
Signed-off-by: Tejun Heo <[email protected]> Reported-by: Pat Cody <[email protected]> Fixes: a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") Cc: [email protected] # v6.12+ Acked-by: Andrea Righi <[email protected]>
show more ...
|
| #
0e9b4c10 |
| 24-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Introduce scx_bpf_nr_node_ids()
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system.
BPF schedul
sched_ext: idle: Introduce scx_bpf_nr_node_ids()
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system.
BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc.
Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3 |
|
| #
48849271 |
| 14-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, sinc
sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask.
Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems.
The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node.
The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag.
By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior.
NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks.
= Test =
Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes
Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself)
Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs:
avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895
= Conclusion =
Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case.
The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.
On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple).
Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test.
[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c
Cc: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
0aaaf89d |
| 14-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or m
sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks.
This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior.
Cc: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
d73249f8 |
| 14-Feb-2025 |
Andrea Righi <[email protected]> |
sched_ext: idle: Make idle static keys private
Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code enc
sched_ext: idle: Make idle static keys private
Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code encapsulation.
Cc: Yury Norov <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
ad3b301a |
| 14-Feb-2025 |
Changwoo Min <[email protected]> |
sched_ext: Provides a sysfs 'events' to expose core event counters
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core event counters through the files system interface. Each line
sched_ext: Provides a sysfs 'events' to expose core event counters
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core event counters through the files system interface. Each line of the file shows the event name and its counter value.
In addition, the format of scx_dump_event() is adjusted as the event name gets longer.
Signed-off-by: Changwoo Min <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
3539c641 |
| 12-Feb-2025 |
Tejun Heo <[email protected]> |
sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP
A task wakeup can be either processed on the waker's CPU or bounced to the wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CP
sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP
A task wakeup can be either processed on the waker's CPU or bounced to the wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU avoids the waker's CPU locking and accessing the wakee's rq which can be expensive across cache and node boundaries.
When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu() may be skipped in some cases (racing against the wakee switching out). As this confused some BPF schedulers, there wasn't a good way for a BPF scheduler to tell whether idle CPU selection has been skipped, ops.enqueue() couldn't insert tasks into foreign local DSQs, and the performance difference on machines with simple toplogies were minimal, sched_ext disabled ttwu_queue.
However, this optimization makes noticeable difference on more complex topologies and a BPF scheduler now has an easy way tell whether ops.select_cpu() was skipped since 9b671793c7d9 ("sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs since 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches").
Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose to enable ttwu_queue optimization.
v2: Update the patch description and comment re. ops.select_cpu() being skipped in some cases as opposed to always as per Neel.
Signed-off-by: Tejun Heo <[email protected]> Reported-by: Neel Natu <[email protected]> Reported-by: Barret Rhoden <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Acked-by: Andrea Righi <[email protected]>
show more ...
|
| #
f5717c93 |
| 12-Feb-2025 |
Chuyi Zhou <[email protected]> |
sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of the current task, the following error will occur:
scx_foo[3795244] triggere
sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of the current task, the following error will occur:
scx_foo[3795244] triggered exit kind 1024: runtime error (called on a task not being operated on)
The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK() when calling ops.tick(), which triggers the error during the subsequent scx_kf_allowed_on_arg_tasks() check.
SCX_CALL_OP_TASK() was first introduced in commit 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") to ensure task's rq lock is held when accessing task's sched_group. Since ops.tick() is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly, the same changes should be made for ops.disable() and ops.exit_task(), as they are also protected by task_rq_lock() and it's safe to access the task's task_group.
Fixes: 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") Signed-off-by: Chuyi Zhou <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|