|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
8fa7292f |
| 05-Apr-2025 |
Thomas Gleixner <[email protected]> |
treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines.
Conversion was done with c
treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6 |
|
| #
a6fd1614 |
| 03-Jan-2025 |
Yafang Shao <[email protected]> |
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched_clock_irqtime may be disabled due to the clock source. When disabled, irq_time_read() won't change over time, so there is
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched_clock_irqtime may be disabled due to the clock source. When disabled, irq_time_read() won't change over time, so there is nothing to account. We can save iterating the whole hierarchy on every tick and context switch.
Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
3840cbe2 |
| 03-Oct-2024 |
Johannes Weiner <[email protected]> |
sched: psi: fix bogus pressure spikes from aggregation race
Brandon reports sporadic, non-sensical spikes in cumulative pressure time (total=) when reading cpu.pressure at a high rate. This is due t
sched: psi: fix bogus pressure spikes from aggregation race
Brandon reports sporadic, non-sensical spikes in cumulative pressure time (total=) when reading cpu.pressure at a high rate. This is due to a race condition between reader aggregation and tasks changing states.
While it affects all states and all resources captured by PSI, in practice it most likely triggers with CPU pressure, since scheduling events are so frequent compared to other resource events.
The race context is the live snooping of ongoing stalls during a pressure read. The read aggregates per-cpu records for stalls that have concluded, but will also incorporate ad-hoc the duration of any active state that hasn't been recorded yet. This is important to get timely measurements of ongoing stalls. Those ad-hoc samples are calculated on-the-fly up to the current time on that CPU; since the stall hasn't concluded, it's expected that this is the minimum amount of stall time that will enter the per-cpu records once it does.
The problem is that the path that concludes the state uses a CPU clock read that is not synchronized against aggregators; the clock is read outside of the seqlock protection. This allows aggregators to race and snoop a stall with a longer duration than will actually be recorded.
With the recorded stall time being less than the last snapshot remembered by the aggregator, a subsequent sample will underflow and observe a bogus delta value, resulting in an erratic jump in pressure.
Fix this by moving the clock read of the state change into the seqlock protection. This ensures no aggregation can snoop live stalls past the time that's recorded when the state concludes.
Reported-by: Brandon Duffany <[email protected]> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194 Link: https://lore.kernel.org/lkml/[email protected]/ Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi") Cc: [email protected] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6 |
|
| #
0ec208ce |
| 25-Jun-2024 |
Tvrtko Ursulin <[email protected]> |
sched/psi: Optimise psi_group_change a bit
The current code loops over the psi_states only to call a helper which then resolves back to the action needed for each state using a switch statement. Tha
sched/psi: Optimise psi_group_change a bit
The current code loops over the psi_states only to call a helper which then resolves back to the action needed for each state using a switch statement. That is effectively creating a double indirection of a kind which, given how all the states need to be explicitly listed and handled anyway, we can simply remove. Both the for loop and the switch statement that is.
The benefit is both in the code size and CPU time spent in this function. YMMV but on my Steam Deck, while in a game, the patch makes the CPU usage go from ~2.4% down to ~1.2%. Text size at the same time went from 0x323 to 0x2c1.
Signed-off-by: Tvrtko Ursulin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.10-rc5 |
|
| #
ddae0ca2 |
| 18-Jun-2024 |
John Stultz <[email protected]> |
sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
It was reported that in moving to 6.1, a larger then 10% regression was seen in the performance of clock_gettime(CLOCK_THREAD_
sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
It was reported that in moving to 6.1, a larger then 10% regression was seen in the performance of clock_gettime(CLOCK_THREAD_CPUTIME_ID,...).
Using a simple reproducer, I found: 5.10: 100000000 calls in 24345994193 ns => 243.460 ns per call 100000000 calls in 24288172050 ns => 242.882 ns per call 100000000 calls in 24289135225 ns => 242.891 ns per call
6.1: 100000000 calls in 28248646742 ns => 282.486 ns per call 100000000 calls in 28227055067 ns => 282.271 ns per call 100000000 calls in 28177471287 ns => 281.775 ns per call
The cause of this was finally narrowed down to the addition of psi_account_irqtime() in update_rq_clock_task(), in commit 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure").
In my initial attempt to resolve this, I leaned towards moving all accounting work out of the clock_gettime() call path, but it wasn't very pretty, so it will have to wait for a later deeper rework. Instead, Peter shared this approach:
Rework psi_account_irqtime() to use its own psi_irq_time base for accounting, and move it out of the hotpath, calling it instead from sched_tick() and __schedule().
In testing this, we found the importance of ensuring psi_account_irqtime() is run under the rq_lock, which Johannes Weiner helpfully explained, so also add some lockdep annotations to make that requirement clear.
With this change the performance is back in-line with 5.10: 6.1+fix: 100000000 calls in 24297324597 ns => 242.973 ns per call 100000000 calls in 24318869234 ns => 243.189 ns per call 100000000 calls in 24291564588 ns => 242.916 ns per call
Reported-by: Jimmy Shiu <[email protected]> Originally-by: Peter Zijlstra <[email protected]> Signed-off-by: John Stultz <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Qais Yousef <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
402de7fc |
| 27-May-2024 |
Ingo Molnar <[email protected]> |
sched: Fix spelling in comments
Do a spell-checking pass.
Signed-off-by: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Ing
sched: Fix spelling in comments
Do a spell-checking pass.
Signed-off-by: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7 |
|
| #
7b3d8df5 |
| 16-Oct-2023 |
Fan Yu <[email protected]> |
sched/psi: Update poll => rtpoll in relevant comments
The PSI trigger code is now making a distinction between privileged and unprivileged triggers, after the following commit:
65457b74aa94 ("sche
sched/psi: Update poll => rtpoll in relevant comments
The PSI trigger code is now making a distinction between privileged and unprivileged triggers, after the following commit:
65457b74aa94 ("sched/psi: Rename existing poll members in preparation")
But some comments have not been modified along with the code, so they need to be updated.
This will help readers better understand the code.
Signed-off-by: Fan Yu <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Peter Ziljstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.6-rc6, v6.6-rc5, v6.6-rc4 |
|
| #
0c292407 |
| 26-Sep-2023 |
Haifeng Xu <[email protected]> |
sched/psi: Bail out early from irq time accounting
We could bail out early when psi was disabled.
Signed-off-by: Haifeng Xu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <peterz@inf
sched/psi: Bail out early from irq time accounting
We could bail out early when psi was disabled.
Signed-off-by: Haifeng Xu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
3657680f |
| 10-Oct-2023 |
Yang Yang <[email protected]> |
sched/psi: Delete the 'update_total' function parameter from update_triggers()
The 'update_total' parameter of update_triggers() is always true after the previous commit:
80cc1d1d5ee3 ("sched/psi
sched/psi: Delete the 'update_total' function parameter from update_triggers()
The 'update_total' parameter of update_triggers() is always true after the previous commit:
80cc1d1d5ee3 ("sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes")
If the 'changed_states & group->rtpoll_states' condition is true, 'new_stall' in update_triggers() will be true, and then 'update_total' should also be true.
So update_total is redundant - remove it.
[ mingo: Changelog updates ]
Signed-off-by: Yang Yang <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Peter Ziljstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
80cc1d1d |
| 10-Oct-2023 |
Yang Yang <[email protected]> |
sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
When psimon wakes up and there are no state changes for ->rtpoll_states, it's unnecessary to update triggers
sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
When psimon wakes up and there are no state changes for ->rtpoll_states, it's unnecessary to update triggers and ->rtpoll_total because the pressures being monitored by the user have not changed.
This will help to slightly reduce unnecessary computations of PSI.
[ mingo: Changelog updates ]
Signed-off-by: Yang Yang <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Peter Ziljstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
e03dc9fa |
| 09-Oct-2023 |
Yang Yang <[email protected]> |
sched/psi: Change update_triggers() to a 'void' function
Update_triggers() always returns now + group->rtpoll_min_period, and the return value is only used by psi_rtpoll_work(), so change update_tri
sched/psi: Change update_triggers() to a 'void' function
Update_triggers() always returns now + group->rtpoll_min_period, and the return value is only used by psi_rtpoll_work(), so change update_triggers() to a void function, let group->rtpoll_next_update = now + group->rtpoll_min_period directly.
This will avoid unnecessary function return value passing & simplifies the function.
[ mingo: Updated changelog ]
Suggested-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Yang Yang <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4 |
|
| #
35cd21f6 |
| 25-May-2023 |
Miaohe Lin <[email protected]> |
sched/psi: make psi_cgroups_enabled static
The static key psi_cgroups_enabled is only used inside file psi.c. Make it static.
Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Peter Z
sched/psi: make psi_cgroups_enabled static
The static key psi_cgroups_enabled is only used inside file psi.c. Make it static.
Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
aff03707 |
| 30-Jun-2023 |
Suren Baghdasaryan <[email protected]> |
sched/psi: use kernfs polling functions for PSI trigger polling
Destroying psi trigger in cgroup_file_release causes UAF issues when a cgroup is removed from under a polling process. This is happeni
sched/psi: use kernfs polling functions for PSI trigger polling
Destroying psi trigger in cgroup_file_release causes UAF issues when a cgroup is removed from under a polling process. This is happening because cgroup removal causes a call to cgroup_file_release while the actual file is still alive. Destroying the trigger at this point would also destroy its waitqueue head and if there is still a polling process on that file accessing the waitqueue, it will step on the freed pointer:
do_select vfs_poll do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy wake_up_pollfree(&t->event_wait) // vfs_poll is unblocked synchronize_rcu kfree(t) poll_freewait -> UAF access to the trigger's waitqueue head
Patch [1] fixed this issue for epoll() case using wake_up_pollfree(), however the same issue exists for synchronous poll() case. The root cause of this issue is that the lifecycles of the psi trigger's waitqueue and of the file associated with the trigger are different. Fix this by using kernfs_generic_poll function when polling on cgroup-specific psi triggers. It internally uses kernfs_open_node->poll waitqueue head with its lifecycle tied to the file's lifecycle. This also renders the fix in [1] obsolete, so revert it.
[1] commit c2dbe32d5db5 ("sched/psi: Fix use-after-free in ep_remove_wait_queue()")
Fixes: 0e94682b73bf ("psi: introduce psi monitor") Closes: https://lore.kernel.org/all/[email protected]/ Reported-by: Lu Jialin <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.4-rc3, v6.4-rc2 |
|
| #
e2a1f85b |
| 14-May-2023 |
Yang Yang <[email protected]> |
sched/psi: Avoid resetting the min update period when it is unnecessary
Psi_group's poll_min_period is determined by the minimum window size of psi_trigger when creating new triggers. While destroyi
sched/psi: Avoid resetting the min update period when it is unnecessary
Psi_group's poll_min_period is determined by the minimum window size of psi_trigger when creating new triggers. While destroying a psi_trigger, there is no need to reset poll_min_period if the psi_trigger being destroyed did not have the minimum window size, since in this condition poll_min_period will remain the same as before.
Signed-off-by: Yang Yang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1 |
|
| #
519fabc7 |
| 03-Mar-2023 |
Suren Baghdasaryan <[email protected]> |
psi: remove 500ms min window size limitation for triggers
Current 500ms min window size for psi triggers limits polling interval to 50ms to prevent polling threads from using too much cpu bandwidth
psi: remove 500ms min window size limitation for triggers
Current 500ms min window size for psi triggers limits polling interval to 50ms to prevent polling threads from using too much cpu bandwidth by polling too frequently. However the number of cgroups with triggers is unlimited, so this protection can be defeated by creating multiple cgroups with psi triggers (triggers in each cgroup are served by a single "psimon" kernel thread). Instead of limiting min polling period, which also limits the latency of psi events, it's better to limit psi trigger creation to authorized users only, like we do for system-wide psi triggers (/proc/pressure/* files can be written only by processes with CAP_SYS_RESOURCE capability). This also makes access rules for cgroup psi files consistent with system-wide ones. Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and remove the psi window min size limitation.
Suggested-by: Sudarshan Rajagopalan <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/all/[email protected]/
show more ...
|
| #
d82caa27 |
| 30-Mar-2023 |
Domenico Cerasuolo <[email protected]> |
sched/psi: Allow unprivileged polling of N*2s period
PSI offers 2 mechanisms to get information about a specific resource pressure. One is reading from /proc/pressure/<resource>, which gives average
sched/psi: Allow unprivileged polling of N*2s period
PSI offers 2 mechanisms to get information about a specific resource pressure. One is reading from /proc/pressure/<resource>, which gives average pressures aggregated every 2s. The other is creating a pollable fd for a specific resource and cgroup.
The trigger creation requires CAP_SYS_RESOURCE, and gives the possibility to pick specific time window and threshold, spawing an RT thread to aggregate the data.
Systemd would like to provide containers the option to monitor pressure on their own cgroup and sub-cgroups. For example, if systemd launches a container that itself then launches services, the container should have the ability to poll() for pressure in individual services. But neither the container nor the services are privileged.
This patch implements a mechanism to allow unprivileged users to create pressure triggers. The difference with privileged triggers creation is that unprivileged ones must have a time window that's a multiple of 2s. This is so that we can avoid unrestricted spawning of rt threads, and use instead the same aggregation mechanism done for the averages, which runs independently of any triggers.
Suggested-by: Johannes Weiner <[email protected]> Signed-off-by: Domenico Cerasuolo <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
4468fcae |
| 30-Mar-2023 |
Domenico Cerasuolo <[email protected]> |
sched/psi: Extract update_triggers side effect
This change moves update_total flag out of update_triggers function, currently called only in psi_poll_work. In the next patch, update_triggers will be
sched/psi: Extract update_triggers side effect
This change moves update_total flag out of update_triggers function, currently called only in psi_poll_work. In the next patch, update_triggers will be called also in psi_avgs_work, but the total update information is specific to psi_poll_work. Returning update_total value to the caller let us avoid differentiating the implementation of update_triggers for different aggregators.
Suggested-by: Johannes Weiner <[email protected]> Signed-off-by: Domenico Cerasuolo <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
65457b74 |
| 30-Mar-2023 |
Domenico Cerasuolo <[email protected]> |
sched/psi: Rename existing poll members in preparation
Renaming in PSI implementation to make a clear distinction between privileged and unprivileged triggers code to be implemented in the next patc
sched/psi: Rename existing poll members in preparation
Renaming in PSI implementation to make a clear distinction between privileged and unprivileged triggers code to be implemented in the next patch.
Suggested-by: Johannes Weiner <[email protected]> Signed-off-by: Domenico Cerasuolo <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
7fab21fa |
| 30-Mar-2023 |
Domenico Cerasuolo <[email protected]> |
sched/psi: Rearrange polling code in preparation
Move a few functions up in the file to avoid forward declaration needed in the patch implementing unprivileged PSI triggers.
Suggested-by: Johannes
sched/psi: Rearrange polling code in preparation
Move a few functions up in the file to avoid forward declaration needed in the patch implementing unprivileged PSI triggers.
Suggested-by: Johannes Weiner <[email protected]> Signed-off-by: Domenico Cerasuolo <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.2 |
|
| #
c2dbe32d |
| 14-Feb-2023 |
Munehisa Kamata <[email protected]> |
sched/psi: Fix use-after-free in ep_remove_wait_queue()
If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the poll
sched/psi: Fix use-after-free in ep_remove_wait_queue()
If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the polling waitqueue gets freed in the following path:
do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy
However, the polling thread still has a reference to the pressure file and will access the freed waitqueue when the file is closed or upon exit:
fput ep_eventpoll_release ep_free ep_remove_wait_queue remove_wait_queue
This results in use-after-free as pasted below.
The fundamental problem here is that cgroup_file_release() (and consequently waitqueue's lifetime) is not tied to the file's real lifetime. Using wake_up_pollfree() here might be less than ideal, but it is in line with the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case. While this would be fixable by somehow making cgroup_file_release() be tied to the fput(), it would require sizable refactoring at cgroups or higher layer which might be more justifiable if we identify more cases like this.
BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0 Write of size 4 at addr ffff88810e625328 by task a.out/4404
CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017 Call Trace: <TASK> dump_stack_lvl+0x73/0xa0 print_report+0x16c/0x4e0 kasan_report+0xc3/0xf0 kasan_check_range+0x2d2/0x310 _raw_spin_lock_irqsave+0x60/0xc0 remove_wait_queue+0x1a/0xa0 ep_free+0x12c/0x170 ep_eventpoll_release+0x26/0x30 __fput+0x202/0x400 task_work_run+0x11d/0x170 do_exit+0x495/0x1130 do_group_exit+0x100/0x100 get_signal+0xd67/0xde0 arch_do_signal_or_restart+0x2a/0x2b0 exit_to_user_mode_prepare+0x94/0x100 syscall_exit_to_user_mode+0x20/0x40 do_syscall_64+0x52/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK>
Allocated by task 4404:
kasan_set_track+0x3d/0x60 __kasan_kmalloc+0x85/0x90 psi_trigger_create+0x113/0x3e0 pressure_write+0x146/0x2e0 cgroup_file_write+0x11c/0x250 kernfs_fop_write_iter+0x186/0x220 vfs_write+0x3d8/0x5c0 ksys_write+0x90/0x110 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 4407:
kasan_set_track+0x3d/0x60 kasan_save_free_info+0x27/0x40 ____kasan_slab_free+0x11d/0x170 slab_free_freelist_hook+0x87/0x150 __kmem_cache_free+0xcb/0x180 psi_trigger_destroy+0x2e8/0x310 cgroup_file_release+0x4f/0xb0 kernfs_drain_open_files+0x165/0x1f0 kernfs_drain+0x162/0x1a0 __kernfs_remove+0x1fb/0x310 kernfs_remove_by_name_ns+0x95/0xe0 cgroup_addrm_files+0x67f/0x700 cgroup_destroy_locked+0x283/0x3c0 cgroup_rmdir+0x29/0x100 kernfs_iop_rmdir+0xd1/0x140 vfs_rmdir+0xfe/0x240 do_rmdir+0x13d/0x280 __x64_sys_rmdir+0x2c/0x30 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: 0e94682b73bf ("psi: introduce psi monitor") Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Mengchi Cheng <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3 |
|
| #
710ffe67 |
| 28-Oct-2022 |
Suren Baghdasaryan <[email protected]> |
sched/psi: Stop relying on timer_pending() for poll_work rescheduling
Psi polling mechanism is trying to minimize the number of wakeups to run psi_poll_work and is currently relying on timer_pending
sched/psi: Stop relying on timer_pending() for poll_work rescheduling
Psi polling mechanism is trying to minimize the number of wakeups to run psi_poll_work and is currently relying on timer_pending() to detect when this work is already scheduled. This provides a window of opportunity for psi_group_change to schedule an immediate psi_poll_work after poll_timer_fn got called but before psi_poll_work could reschedule itself. Below is the depiction of this entire window:
poll_timer_fn wake_up_interruptible(&group->poll_wait);
psi_poll_worker wait_event_interruptible(group->poll_wait, ...) psi_poll_work psi_schedule_poll_work if (timer_pending(&group->poll_timer)) return; ... mod_timer(&group->poll_timer, jiffies + delay);
Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was reset and set back inside psi_poll_work and therefore this race window was much smaller. The larger window causes increased number of wakeups and our partners report visible power regression of ~10mA after applying 461daba06bdc. Bring back the poll_scheduled atomic and make this race window even narrower by resetting poll_scheduled only when we reach polling expiration time. This does not completely eliminate the possibility of extra wakeups caused by a race with psi_group_change however it will limit it to the worst case scenario of one extra wakeup per every tracking window (0.5s in the worst case). This patch also ensures correct ordering between clearing poll_scheduled flag and obtaining changed_states using memory barrier. Correct ordering between updating changed_states and setting poll_scheduled is ensured by atomic_xchg operation. By tracing the number of immediate rescheduling attempts performed by psi_group_change and the number of these attempts being blocked due to psi monitor being already active, we can assess the effects of this change:
Before the patch: Run#1 Run#2 Run#3 Immediate reschedules attempted: 684365 1385156 1261240 Immediate reschedules blocked: 682846 1381654 1258682 Immediate reschedules (delta): 1519 3502 2558 Immediate reschedules (% of attempted): 0.22% 0.25% 0.20%
After the patch: Run#1 Run#2 Run#3 Immediate reschedules attempted: 882244 770298 426218 Immediate reschedules blocked: 881996 769796 426074 Immediate reschedules (delta): 248 502 144 Immediate reschedules (% of attempted): 0.03% 0.07% 0.03%
The number of non-blocked immediate reschedules dropped from 0.22-0.25% to 0.03-0.07%. The drop is attributed to the decrease in the race window size and the fact that we allow this race only when psi monitors reach polling window expiration time.
Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism") Reported-by: Kathleen Chang <[email protected]> Reported-by: Wenju Xu <[email protected]> Reported-by: Jonathan Chen <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Tested-by: SH Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.1-rc2, v6.1-rc1 |
|
| #
2fcd7bba |
| 14-Oct-2022 |
Chengming Zhou <[email protected]> |
sched/psi: Fix avgs_work re-arm in psi_avgs_work()
Pavan reported a problem that PSI avgs_work idle shutoff is not working at all. Because PSI_NONIDLE condition would be observed in psi_avgs_work()-
sched/psi: Fix avgs_work re-arm in psi_avgs_work()
Pavan reported a problem that PSI avgs_work idle shutoff is not working at all. Because PSI_NONIDLE condition would be observed in psi_avgs_work()->collect_percpu_times()->get_recent_times() even if only the kworker running avgs_work on the CPU.
Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off") avoided the ping-pong wake problem when the worker sleep, psi_avgs_work() still will always re-arm the avgs_work, so shutoff is not working.
This patch changes to use PSI_STATE_RESCHEDULE to flag whether to re-arm avgs_work in get_recent_times(). For the current CPU, we re-arm avgs_work only when (NR_RUNNING > 1 || NR_IOWAIT > 0 || NR_MEMSTALL > 0), for other CPUs we can just check PSI_NONIDLE delta. The new flag is only used in psi_avgs_work(), so we check in get_recent_times() that current_work() is avgs_work.
One potential problem is that the brief period of non-idle time incurred between the aggregation run and the kworker's dequeue will be stranded in the per-cpu buckets until avgs_work run next time. The buckets can hold 4s worth of time, and future activity will wake the avgs_work with a 2s delay, giving us 2s worth of data we can leave behind when shut off the avgs_work. If the kworker run other works after avgs_work shut off and doesn't have any scheduler activities for 2s, this maybe a problem.
Reported-by: Pavan Kondeti <[email protected]> Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Tested-by: Chengming Zhou <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.0, v6.0-rc7 |
|
| #
e38f89af |
| 19-Sep-2022 |
Hao Lee <[email protected]> |
sched/psi: Fix possible missing or delayed pending event
When a pending event exists and growth is less than the threshold, the current logic is to skip this trigger without generating event. Howeve
sched/psi: Fix possible missing or delayed pending event
When a pending event exists and growth is less than the threshold, the current logic is to skip this trigger without generating event. However, from e6df4ead85d9 ("psi: fix possible trigger missing in the window"), our purpose is to generate event as long as pending event exists and the rate meets the limit, no matter what growth is. This patch handles this case properly.
Fixes: e6df4ead85d9 ("psi: fix possible trigger missing in the window") Signed-off-by: Hao Lee <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.0-rc6 |
|
| #
527eb453 |
| 15-Sep-2022 |
Christoph Hellwig <[email protected]> |
sched/psi: export psi_memstall_{enter,leave}
To properly account for all refaults from file system logic, file systems need to call psi_memstall_enter directly, so export it.
Signed-off-by: Christo
sched/psi: export psi_memstall_{enter,leave}
To properly account for all refaults from file system logic, file systems need to call psi_memstall_enter directly, so export it.
Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc5 |
|
| #
34f26a15 |
| 07-Sep-2022 |
Chengming Zhou <[email protected]> |
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhe
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhead for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable interface "cgroup.pressure", which is a read-write single value file that allowed values are "0" and "1", the defaults is "1" so per-cgroup PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable state aggregation, time tracking, averaging on a per-cgroup level, if we can live with losing history from while it was disabled. I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates, which is not implemented in this patch. So we always update groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <[email protected]> Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|