History log of /linux-6.15/kernel/sched/ext.c (Results 1 – 25 of 195)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6
# 428dc9fc 05-May-2025 Tejun Heo <[email protected]>

sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator

BPF programs may call next() and destroy() on BPF iterators even after new()
returns an error value (e.g. bpf_for_each() macro ign

sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator

BPF programs may call next() and destroy() on BPF iterators even after new()
returns an error value (e.g. bpf_for_each() macro ignores error returns from
new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized
state after an error return causing bpf_iter_scx_dsq_next() to dereference
garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that
next() and destroy() become noops.

Signed-off-by: Tejun Heo <[email protected]>
Fixes: 650ba21b131e ("sched_ext: Implement DSQ iterator")
Cc: [email protected] # v6.12+
Acked-by: Andrea Righi <[email protected]>

show more ...


Revision tags: v6.15-rc5
# e38be1c7 28-Apr-2025 Andrea Righi <[email protected]>

sched_ext: Fix rq lock state in hotplug ops

The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume
that the rq involved in the operation is locked, which is not the case
during hotp

sched_ext: Fix rq lock state in hotplug ops

The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume
that the rq involved in the operation is locked, which is not the case
during hotplug, triggering the following warning:

WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340

Fix by not tracking the target rq as locked in the context of
ops.cpu_online() and ops.cpu_offline().

Fixes: 18853ba782bef ("sched_ext: Track currently locked rq")
Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Tested-by: Changwoo Min <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.15-rc4
# e7dcd130 25-Apr-2025 Andrea Righi <[email protected]>

sched_ext: Remove duplicate BTF_ID_FLAGS definitions

Some kfuncs specific to the idle CPU selection policy are registered in
both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though
the

sched_ext: Remove duplicate BTF_ID_FLAGS definitions

Some kfuncs specific to the idle CPU selection policy are registered in
both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though
they should only be defined in the latter.

Remove the duplicates from scx_kfunc_ids_any.

Fixes: 337d1b354a297 ("sched_ext: Move built-in idle CPU selection policy to a separate file")
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# a11d6784 22-Apr-2025 Andrea Righi <[email protected]>

sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()

scx_bpf_cpuperf_set() can be used to set a performance target level on
any CPU. However, it doesn't correctly acquire the corresponding rq
loc

sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()

scx_bpf_cpuperf_set() can be used to set a performance target level on
any CPU. However, it doesn't correctly acquire the corresponding rq
lock, which may lead to unsafe behavior and trigger the following
warning, due to the lockdep_assert_rq_held() check:

[ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0
...
[ 51.713836] Call trace:
[ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P)
[ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440
[ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c
[ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0
[ 51.713845] bpf_scx_reg+0x18/0x30
[ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0
[ 51.713849] __sys_bpf+0x1934/0x22a0

Fix by properly acquiring the rq lock when possible or raising an error
if we try to operate on a CPU that is not the one currently locked.

Fixes: d86adb4fc0655 ("sched_ext: Add cpuperf support")
Signed-off-by: Andrea Righi <[email protected]>
Acked-by: Changwoo Min <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# 18853ba7 22-Apr-2025 Andrea Righi <[email protected]>

sched_ext: Track currently locked rq

Some kfuncs provided by sched_ext may need to operate on a struct rq,
but they can be invoked from various contexts, specifically, different
scx callbacks.

Whil

sched_ext: Track currently locked rq

Some kfuncs provided by sched_ext may need to operate on a struct rq,
but they can be invoked from various contexts, specifically, different
scx callbacks.

While some of these callbacks are invoked with a particular rq already
locked, others are not. This makes it impossible for a kfunc to reliably
determine whether it's safe to access a given rq, triggering potential
bugs or unsafe behaviors, see for example [1].

To address this, track the currently locked rq whenever a sched_ext
callback is invoked via SCX_CALL_OP*().

This allows kfuncs that need to operate on an arbitrary rq to retrieve
the currently locked one and apply the appropriate action as needed.

[1] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Acked-by: Changwoo Min <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.15-rc3, v6.15-rc2
# bc08b15b 08-Apr-2025 Tejun Heo <[email protected]>

sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation

SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup
weight support warnings. Now that the warnings are removed, the flag

sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation

SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup
weight support warnings. Now that the warnings are removed, the flag doesn't
do anything. Mark it for deprecation and remove its usage from scx_flatcg.

v2: Actually include the scx_flatcg update.

Signed-off-by: Tejun Heo <[email protected]>
Suggested-and-reviewed-by: Andrea Righi <[email protected]>

show more ...


# e776b26e 07-Apr-2025 Tejun Heo <[email protected]>

sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings

sched_ext generates warnings when cpu.weight / cpu.idle are set to
non-default values if the BPF scheduler doesn't implement weight sup

sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings

sched_ext generates warnings when cpu.weight / cpu.idle are set to
non-default values if the BPF scheduler doesn't implement weight support.
These warnings don't provide much value while adding constant annoyance. A
BPF scheduler may not implement any particular behavior and there's nothing
particularly special about missing cgroup weight support. Drop the warnings.

Signed-off-by: Tejun Heo <[email protected]>

show more ...


# 47068309 08-Apr-2025 Breno Leitao <[email protected]>

sched_ext: Use kvzalloc for large exit_dump allocation

Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which
can require large contiguous memory depending on the implementation.
T

sched_ext: Use kvzalloc for large exit_dump allocation

Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which
can require large contiguous memory depending on the implementation.
This change prevents allocation failures by allowing the system to fall
back to vmalloc when contiguous memory allocation fails.

Since this buffer is only used for debugging purposes, physical memory
contiguity is not required, making vmalloc a suitable alternative.

Cc: [email protected]
Fixes: 07814a9439a3b0 ("sched_ext: Print debug dump after an error exit")
Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Breno Leitao <[email protected]>
Acked-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.15-rc1
# f0c6eab5 25-Mar-2025 Andrea Righi <[email protected]>

sched_ext: initialize built-in idle state before ops.init()

A BPF scheduler may want to use the built-in idle cpumasks in ops.init()
before the scheduler is fully initialized, either directly or thr

sched_ext: initialize built-in idle state before ops.init()

A BPF scheduler may want to use the built-in idle cpumasks in ops.init()
before the scheduler is fully initialized, either directly or through a
BPF timer for example.

However, this would result in an error, since the idle state has not
been properly initialized yet.

This can be easily verified by modifying scx_simple to call
scx_bpf_get_idle_cpumask() in ops.init():

$ sudo scx_simple

DEBUG DUMP
===========================================================================

scx_simple[121] triggered exit kind 1024:
runtime error (built-in idle tracking is disabled)
...

Fix this by properly initializing the idle state before ops.init() is
called. With this change applied:

$ sudo scx_simple
local=2 global=0
local=19 global=11
local=23 global=11
...

Fixes: d73249f88743d ("sched_ext: idle: Make idle static keys private")
Signed-off-by: Andrea Righi <[email protected]>
Reviewed-by: Joel Fernandes <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# a8897ed8 25-Mar-2025 Jake Hillion <[email protected]>

sched_ext: create_dsq: Return -EEXIST on duplicate request

create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently
ignore duplicate entries. As a sched_ext scheduler is creating eac

sched_ext: create_dsq: Return -EEXIST on duplicate request

create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently
ignore duplicate entries. As a sched_ext scheduler is creating each DSQ
for a different purpose this is surprising behaviour.

Replace rhashtable_insert_fast which ignores duplicates with
rhashtable_lookup_insert_fast that reports duplicates (though doesn't
return their value). The rest of the code is structured correctly and
this now returns -EEXIST.

Tested by adding an extra scx_bpf_create_dsq to scx_simple. Previously
this was ignored, now init fails with a -17 code. Also ran scx_lavd
which continued to work well.

Signed-off-by: Jake Hillion <[email protected]>
Acked-by: Andrea Righi <[email protected]>
Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
Cc: [email protected] # v6.12+
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14
# f7d2728c 17-Mar-2025 Ingo Molnar <[email protected]>

sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()

The scheduler has this special SCHED_WARN() facility that
depends on CONFIG_SCHED_DEBUG.

Since CONFIG_SCHED_DEBUG is getting removed, convert
S

sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()

The scheduler has this special SCHED_WARN() facility that
depends on CONFIG_SCHED_DEBUG.

Since CONFIG_SCHED_DEBUG is getting removed, convert
SCHED_WARN() to WARN_ON_ONCE().

Note that the warning output isn't 100% equivalent:

#define SCHED_WARN_ON(x) WARN_ONCE(x, #x)

Because SCHED_WARN_ON() would output the 'x' condition
as well, while WARN_ONCE() will only show a backtrace.

Hopefully these are rare enough to not really matter.

If it does, we should probably introduce a new WARN_ON()
variant that outputs the condition in stringified form,
or improve WARN_ON() itself.

Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Shrikanth Hegde <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.14-rc7
# e4855fc9 14-Mar-2025 Andrea Righi <[email protected]>

sched_ext: idle: Refactor scx_select_cpu_dfl()

Make scx_select_cpu_dfl() more consistent with the other idle-related
APIs by returning a negative value when an idle CPU isn't found.

No functional c

sched_ext: idle: Refactor scx_select_cpu_dfl()

Make scx_select_cpu_dfl() more consistent with the other idle-related
APIs by returning a negative value when an idle CPU isn't found.

No functional changes, this is purely a refactoring.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# c414c217 14-Mar-2025 Andrea Righi <[email protected]>

sched_ext: idle: Honor idle flags in the built-in idle selection policy

Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(),
to enforce strict selection criteria, such as selecting

sched_ext: idle: Honor idle flags in the built-in idle selection policy

Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(),
to enforce strict selection criteria, such as selecting an idle CPU
strictly within @prev_cpu's node or choosing only a fully idle SMT core.

This functionality will be exposed through a dedicated kfunc in a
separate patch.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# 97e13ecb 13-Mar-2025 Andrea Righi <[email protected]>

sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()

scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give
tasks that are queued to the local DSQ a chance to migrate to other

sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()

scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give
tasks that are queued to the local DSQ a chance to migrate to other
CPUs, when a CPU is taken by a higher scheduling class.

However, there is no point re-enqueuing tasks that can only run on that
particular CPU, as they would simply be re-added to the same local DSQ
without any benefit.

Therefore, skip per-CPU tasks in scx_bpf_reenqueue_local().

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14-rc6
# 71d07880 04-Mar-2025 Changwoo Min <[email protected]>

sched_ext: Add trace point to track sched_ext core events

Add tracing support to track sched_ext core events
(/sched_ext/sched_ext_event). This may be useful for debugging sched_ext
schedulers that

sched_ext: Add trace point to track sched_ext core events

Add tracing support to track sched_ext core events
(/sched_ext/sched_ext_event). This may be useful for debugging sched_ext
schedulers that trigger a particular event.

The trace point can be used as other trace points, so it can be used in,
for example, `perf trace` and BPF programs, as follows:

======
$> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"'
======

======
struct tp_sched_ext_event {
struct trace_entry ent;
u32 __data_loc_name;
s64 delta;
};

SEC("tracepoint/sched_ext/sched_ext_event")
int rtp_add_event(struct tp_sched_ext_event *ctx)
{
char event_name[128];
unsigned short offset = ctx->__data_loc_name & 0xFFFF;
bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset);

bpf_printk("name %s delta %lld", event_name, ctx->delta);
return 0;
}
======

Signed-off-by: Changwoo Min <[email protected]>
Acked-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# 038730dc 04-Mar-2025 Changwoo Min <[email protected]>

sched_ext: Change the event type from u64 to s64

The event count could be negative in the future,
so change the event type from u64 to s64.

Signed-off-by: Changwoo Min <[email protected]>
Signed-

sched_ext: Change the event type from u64 to s64

The event count could be negative in the future,
so change the event type from u64 to s64.

Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# 9360dfe4 03-Mar-2025 Andrea Righi <[email protected]>

sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()

If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash

sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()

If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.

To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.

Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class")
Cc: [email protected] # v6.12+
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14-rc5
# 8fef0a3b 25-Feb-2025 Tejun Heo <[email protected]>

sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()

a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called
without preceding balance_scx()") ad

sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()

a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called
without preceding balance_scx()") added a workaround to handle the cases
where pick_task_scx() is called without prececing balance_scx() which is due
to a fair class bug where pick_taks_fair() may return NULL after a true
return from balance_fair().

The workaround detects when pick_task_scx() is called without preceding
balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid
stalling. Unfortunately, the workaround code was testing whether @prev was
on SCX to decide whether to keep the task running. This is incorrect as the
task may be on SCX but no longer runnable.

This could lead to a non-runnable task to be returned from pick_task_scx()
which cause interesting confusions and failures. e.g. A common failure mode
is the task ending up with (!on_rq && on_cpu) state which can cause
potential wakers to busy loop, which can easily lead to deadlocks.

Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes
@prev_on_scx only used in one place. Open code the usage and improve the
comment while at it.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Pat Cody <[email protected]>
Fixes: a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()")
Cc: [email protected] # v6.12+
Acked-by: Andrea Righi <[email protected]>

show more ...


# 0e9b4c10 24-Feb-2025 Andrea Righi <[email protected]>

sched_ext: idle: Introduce scx_bpf_nr_node_ids()

Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc
scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the
system.

BPF schedul

sched_ext: idle: Introduce scx_bpf_nr_node_ids()

Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc
scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the
system.

BPF schedulers can use this information together with the new node-aware
kfuncs, for example to create per-node DSQs, validate node IDs, etc.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14-rc4, v6.14-rc3
# 48849271 14-Feb-2025 Andrea Righi <[email protected]>

sched_ext: idle: Per-node idle cpumasks

Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, sinc

sched_ext: idle: Per-node idle cpumasks

Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.

Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.

The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.

The split of multiple per-node idle cpumasks can be controlled using the
SCX_OPS_BUILTIN_IDLE_PER_NODE flag.

By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
host-wide idle cpumask is used, maintaining the previous behavior.

NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.

= Test =

Hardware:
- System: DGX B200
- CPUs: 224 SMT threads (112 physical cores)
- Processor: INTEL(R) XEON(R) PLATINUM 8570
- 2 NUMA nodes

Scheduler:
- scx_simple [1] (so that we can focus at the built-in idle selection
policy and not at the scheduling policy itself)

Test:
- Run a parallel kernel build `make -j $(nproc)` and measure the average
elapsed time over 10 runs:

avg time | stdev
---------+------
before: 52.431s | 2.895
after: 50.342s | 2.895

= Conclusion =

Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.

The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @
2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.

On smaller systems, I haven't noticed any measurable regressions or
improvements with the same test (parallel kernel build) and scheduler
(scx_simple).

Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs
I observed an additional +2-2.5% performance improvement with the same
test.

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c

Cc: Yury Norov [NVIDIA] <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Reviewed-by: Yury Norov [NVIDIA] <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# 0aaaf89d 14-Feb-2025 Andrea Righi <[email protected]>

sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE

Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows
BPF schedulers to select between using a global flat idle cpumask or
m

sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE

Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows
BPF schedulers to select between using a global flat idle cpumask or
multiple per-node cpumasks.

This only introduces the flag and the mechanism to enable/disable this
feature without affecting any scheduling behavior.

Cc: Yury Norov [NVIDIA] <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Reviewed-by: Yury Norov [NVIDIA] <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# d73249f8 14-Feb-2025 Andrea Righi <[email protected]>

sched_ext: idle: Make idle static keys private

Make all the static keys used by the idle CPU selection policy private
to ext_idle.c. This avoids unnecessary exposure in headers and improves
code enc

sched_ext: idle: Make idle static keys private

Make all the static keys used by the idle CPU selection policy private
to ext_idle.c. This avoids unnecessary exposure in headers and improves
code encapsulation.

Cc: Yury Norov <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# ad3b301a 14-Feb-2025 Changwoo Min <[email protected]>

sched_ext: Provides a sysfs 'events' to expose core event counters

Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core
event counters through the files system interface. Each line

sched_ext: Provides a sysfs 'events' to expose core event counters

Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core
event counters through the files system interface. Each line of the file
shows the event name and its counter value.

In addition, the format of scx_dump_event() is adjusted as the event name
gets longer.

Signed-off-by: Changwoo Min <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


# 3539c641 12-Feb-2025 Tejun Heo <[email protected]>

sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP

A task wakeup can be either processed on the waker's CPU or bounced to the
wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CP

sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP

A task wakeup can be either processed on the waker's CPU or bounced to the
wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU
avoids the waker's CPU locking and accessing the wakee's rq which can be
expensive across cache and node boundaries.

When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu()
may be skipped in some cases (racing against the wakee switching out). As
this confused some BPF schedulers, there wasn't a good way for a BPF
scheduler to tell whether idle CPU selection has been skipped, ops.enqueue()
couldn't insert tasks into foreign local DSQs, and the performance
difference on machines with simple toplogies were minimal, sched_ext
disabled ttwu_queue.

However, this optimization makes noticeable difference on more complex
topologies and a BPF scheduler now has an easy way tell whether
ops.select_cpu() was skipped since 9b671793c7d9 ("sched_ext, scx_qmap: Add
and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs
since 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct
dispatches").

Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose
to enable ttwu_queue optimization.

v2: Update the patch description and comment re. ops.select_cpu() being
skipped in some cases as opposed to always as per Neel.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Neel Natu <[email protected]>
Reported-by: Barret Rhoden <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Andrea Righi <[email protected]>

show more ...


# f5717c93 12-Feb-2025 Chuyi Zhou <[email protected]>

sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx

Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of
the current task, the following error will occur:

scx_foo[3795244] triggere

sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx

Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of
the current task, the following error will occur:

scx_foo[3795244] triggered exit kind 1024:
runtime error (called on a task not being operated on)

The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK()
when calling ops.tick(), which triggers the error during the subsequent
scx_kf_allowed_on_arg_tasks() check.

SCX_CALL_OP_TASK() was first introduced in commit 36454023f50b ("sched_ext:
Track tasks that are subjects of the in-flight SCX operation") to ensure
task's rq lock is held when accessing task's sched_group. Since ops.tick()
is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq
lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly,
the same changes should be made for ops.disable() and ops.exit_task(), as
they are also protected by task_rq_lock() and it's safe to access the
task's task_group.

Fixes: 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation")
Signed-off-by: Chuyi Zhou <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


12345678