ext_idle.c - OpenGrok history log for /linux-6.15/kernel/sched/ext

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4
# 18853ba7	22-Apr-2025	Andrea Righi <[email protected]>	sched_ext: Track currently locked rq Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks. Whil sched_ext: Track currently locked rq Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks. While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1]. To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*(). This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed. [1] https://lore.kernel.org/lkml/[email protected]/ Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Acked-by: Changwoo Min <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
Revision tags: v6.15-rc3, v6.15-rc2, v6.15-rc1
# f0c6eab5	25-Mar-2025	Andrea Righi <[email protected]>	sched_ext: initialize built-in idle state before ops.init() A BPF scheduler may want to use the built-in idle cpumasks in ops.init() before the scheduler is fully initialized, either directly or thr sched_ext: initialize built-in idle state before ops.init() A BPF scheduler may want to use the built-in idle cpumasks in ops.init() before the scheduler is fully initialized, either directly or through a BPF timer for example. However, this would result in an error, since the idle state has not been properly initialized yet. This can be easily verified by modifying scx_simple to call scx_bpf_get_idle_cpumask() in ops.init(): $ sudo scx_simple DEBUG DUMP =========================================================================== scx_simple[121] triggered exit kind 1024: runtime error (built-in idle tracking is disabled) ... Fix this by properly initializing the idle state before ops.init() is called. With this change applied: $ sudo scx_simple local=2 global=0 local=19 global=11 local=23 global=11 ... Fixes: d73249f88743d ("sched_ext: idle: Make idle static keys private") Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Joel Fernandes <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
# 883cc35e	26-Mar-2025	Tejun Heo <[email protected]>	sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl() scx_select_cpu_dfl() has a meaningless conditional goto at the end. Remove it. No functional changes. Signed-off-by: Tejun H sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl() scx_select_cpu_dfl() has a meaningless conditional goto at the end. Remove it. No functional changes. Signed-off-by: Tejun Heo <[email protected]> Cc: Andrea Righi <[email protected]> show more ...
# 37477d9e	26-Mar-2025	Andrea Righi <[email protected]>	sched_ext: idle: Fix return code of scx_select_cpu_dfl() Return -EBUSY when using %SCX_PICK_IDLE_CORE with scx_select_cpu_dfl() if a fully idle SMT core cannot be found, instead of falling back to @ sched_ext: idle: Fix return code of scx_select_cpu_dfl() Return -EBUSY when using %SCX_PICK_IDLE_CORE with scx_select_cpu_dfl() if a fully idle SMT core cannot be found, instead of falling back to @prev_cpu, which is not a fully idle SMT core in this case. Fixes: c414c2171cd9e ("sched_ext: idle: Honor idle flags in the built-in idle selection policy") Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
Revision tags: v6.14, v6.14-rc7
# e4855fc9	14-Mar-2025	Andrea Righi <[email protected]>	sched_ext: idle: Refactor scx_select_cpu_dfl() Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found. No functional c sched_ext: idle: Refactor scx_select_cpu_dfl() Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found. No functional changes, this is purely a refactoring. Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
# c414c217	14-Mar-2025	Andrea Righi <[email protected]>	sched_ext: idle: Honor idle flags in the built-in idle selection policy Enable passing idle flags (%SCX_PICK_IDLE_) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting sched_ext: idle: Honor idle flags in the built-in idle selection policy Enable passing idle flags (%SCX_PICK_IDLE_) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting an idle CPU strictly within @prev_cpu's node or choosing only a fully idle SMT core. This functionality will be exposed through a dedicated kfunc in a separate patch. Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
Revision tags: v6.14-rc6, v6.14-rc5
# fde7d647	25-Feb-2025	Andrea Righi <[email protected]>	sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node() should always return a CPU from the specified node, regardless of its sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node() should always return a CPU from the specified node, regardless of its idle state. Also clarify this logic in the function documentation. Fixes: 01059219b0cfd ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers") Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
Revision tags: v6.14-rc4
# 01059219	18-Feb-2025	Andrea Righi <[email protected]>	sched_ext: idle: Introduce node-aware idle cpu kfunc helpers Introduce a new kfunc to retrieve the node associated to a CPU: int scx_bpf_cpu_node(s32 cpu) Add the following kfuncs to provide BPF sched_ext: idle: Introduce node-aware idle cpu kfunc helpers Introduce a new kfunc to retrieve the node associated to a CPU: int scx_bpf_cpu_node(s32 cpu) Add the following kfuncs to provide BPF schedulers direct access to per-node idle cpumasks information: const struct cpumask scx_bpf_get_idle_cpumask_node(int node) const struct cpumask scx_bpf_get_idle_smtmask_node(int node) s32 scx_bpf_pick_idle_cpu_node(const cpumask_t cpus_allowed, int node, u64 flags) s32 scx_bpf_pick_any_cpu_node(const cpumask_t cpus_allowed, int node, u64 flags) Moreover, trigger an scx error when any of the non-node aware idle CPU kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled. Cc: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
Revision tags: v6.14-rc3
# 48849271	14-Feb-2025	Andrea Righi <[email protected]>	sched_ext: idle: Per-node idle cpumasks Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, sinc sched_ext: idle: Per-node idle cpumasks Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag. By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior. NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks. = Test = Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself) Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs: avg time \| stdev ---------+------ before: 52.431s \| 2.895 after: 50.342s \| 2.895 = Conclusion = Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case. The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%. On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple). Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test. [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c Cc: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
# 0aaaf89d	14-Feb-2025	Andrea Righi <[email protected]>	sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or m sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks. This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior. Cc: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Reviewed-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
# d73249f8	14-Feb-2025	Andrea Righi <[email protected]>	sched_ext: idle: Make idle static keys private Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code enc sched_ext: idle: Make idle static keys private Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code encapsulation. Cc: Yury Norov <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
Revision tags: v6.14-rc2, v6.14-rc1
# 337d1b35	27-Jan-2025	Andrea Righi <[email protected]>	sched_ext: Move built-in idle CPU selection policy to a separate file As ext.c is becoming quite large, move the idle CPU selection policy to separate files (ext_idle.c / ext_idle.h) for better code sched_ext: Move built-in idle CPU selection policy to a separate file As ext.c is becoming quite large, move the idle CPU selection policy to separate files (ext_idle.c / ext_idle.h) for better code readability. Moreover, group together all the idle CPU selection kfunc's to the same btf_kfunc_id_set block. No functional changes, this is purely code reorganization. Suggested-by: Yury Norov <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...