|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
7d6c63c3 |
| 01-Apr-2025 |
Shakeel Butt <[email protected]> |
cgroup: rstat: call cgroup_rstat_updated_list with cgroup_rstat_lock
The commit 093c8812de2d ("cgroup: rstat: Cleanup flushing functions and locking") during cleanup accidentally changed the code to
cgroup: rstat: call cgroup_rstat_updated_list with cgroup_rstat_lock
The commit 093c8812de2d ("cgroup: rstat: Cleanup flushing functions and locking") during cleanup accidentally changed the code to call cgroup_rstat_updated_list() without cgroup_rstat_lock which is required. Fix it.
Fixes: 093c8812de2d ("cgroup: rstat: Cleanup flushing functions and locking") Reported-by: Jakub Kicinski <[email protected]> Reported-by: Breno Leitao <[email protected]> Reported-by: Venkat Rao Bagalkote <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Shakeel Butt <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14 |
|
| #
093c8812 |
| 19-Mar-2025 |
Yosry Ahmed <[email protected]> |
cgroup: rstat: Cleanup flushing functions and locking
Now that the rstat lock is being re-acquired on every CPU iteration in cgroup_rstat_flush_locked(), having the initially acquire the lock is unn
cgroup: rstat: Cleanup flushing functions and locking
Now that the rstat lock is being re-acquired on every CPU iteration in cgroup_rstat_flush_locked(), having the initially acquire the lock is unnecessary and unclear.
Inline cgroup_rstat_flush_locked() into cgroup_rstat_flush() and move the lock/unlock calls to the beginning and ending of the loop body to make the critical section obvious.
cgroup_rstat_flush_hold/release() do not make much sense with the lock being dropped and reacquired internally. Since it has no external callers, remove it and explicitly acquire the lock in cgroup_base_stat_cputime_show() instead.
This leaves the code with a single flushing function, cgroup_rstat_flush().
Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
0efc297a |
| 19-Mar-2025 |
Eric Dumazet <[email protected]> |
cgroup/rstat: avoid disabling irqs for O(num_cpu)
cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while iterating all possible cpus. It only drops the lock if there is scheduler or
cgroup/rstat: avoid disabling irqs for O(num_cpu)
cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while iterating all possible cpus. It only drops the lock if there is scheduler or spin lock contention. If neither, then interrupts can be disabled for a long time. On large machines this can disable interrupts for a long enough time to drop network packets. On 400+ CPU machines I've seen interrupt disabled for over 40 msec.
Prevent rstat from disabling interrupts while processing all possible cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This approach was previously discussed in https://lore.kernel.org/lkml/ZBz%2FV5a7%[email protected]/, though this was in the context of an non-irq rstat spin lock.
Benchmark this change with: 1) a single stat_reader process with 400 threads, each reading a test memcg's memory.stat repeatedly for 10 seconds. 2) 400 memory hog processes running in the test memcg and repeatedly charging memory until oom killed. Then they repeat charging and oom killing.
v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds interrupts are disabled by rstat for 45341 usec: # => started at: _raw_spin_lock_irq # => ended at: cgroup_rstat_flush # # # _------=> CPU# # / _-----=> irqs-off/BH-disabled # | / _----=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # |||| / _-=> migrate-disable # ||||| / delay # cmd pid |||||| time | caller # \ / |||||| \ | / stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq stat_rea-96532 52d.... 45342us : cgroup_rstat_flush stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush stat_rea-96532 52d.... 45343us : <stack trace> => memcg1_stat_format => memory_stat_format => memory_stat_show => seq_read_iter => vfs_read => ksys_read => do_syscall_64 => entry_SYSCALL_64_after_hwframe
With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the longest holder. The longest irqs-off holder has irqs disabled for 4142 usec, a huge reduction from previous 45341 usec rstat finding.
Running stat_reader memory.stat reader for 10 seconds: - without memory hogs: 9.84M accesses => 12.7M accesses - with memory hogs: 9.46M accesses => 11.1M accesses The throughput of memory.stat access improves.
The mode of memory.stat access latency after grouping by of 2 buckets: - without memory hogs: 64 usec => 16 usec - with memory hogs: 64 usec => 8 usec The memory.stat latency improves.
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Greg Thelen <[email protected]> Tested-by: Greg Thelen <[email protected]> Acked-by: Michal Koutný <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
c4af66a9 |
| 09-Feb-2025 |
Abel Wu <[email protected]> |
cgroup/rstat: Fix forceidle time in cpu.stat
The commit b824766504e4 ("cgroup/rstat: add force idle show helper") retrieves forceidle_time outside cgroup_rstat_lock for non-root cgroups which can be
cgroup/rstat: Fix forceidle time in cpu.stat
The commit b824766504e4 ("cgroup/rstat: add force idle show helper") retrieves forceidle_time outside cgroup_rstat_lock for non-root cgroups which can be potentially inconsistent with other stats.
Rather than reverting that commit, fix it in a way that retains the effort of cleaning up the ifdef-messes.
Fixes: b824766504e4 ("cgroup/rstat: add force idle show helper") Signed-off-by: Abel Wu <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
db5fd3cf |
| 07-Feb-2025 |
Muhammad Adeel <[email protected]> |
cgroup: Remove steal time from usage_usec
The CPU usage time is the time when user, system or both are using the CPU. Steal time is the time when CPU is waiting to be run by the Hypervisor. It shoul
cgroup: Remove steal time from usage_usec
The CPU usage time is the time when user, system or both are using the CPU. Steal time is the time when CPU is waiting to be run by the Hypervisor. It should not be added to the CPU usage time, hence removing it from the usage_usec entry.
Fixes: 936f2a70f2077 ("cgroup: add cpu.stat file to root cgroup") Acked-by: Axel Busch <[email protected]> Acked-by: Michal Koutný <[email protected]> Signed-off-by: Muhammad Adeel <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
aefa398d |
| 02-Oct-2024 |
Joshua Hahn <[email protected]> |
cgroup/rstat: Tracking cgroup-level niced CPU time
Cgroup-level CPU statistics currently include time spent on user/system processes, but do not include niced CPU time (despite already being tracked
cgroup/rstat: Tracking cgroup-level niced CPU time
Cgroup-level CPU statistics currently include time spent on user/system processes, but do not include niced CPU time (despite already being tracked). This patch exposes niced CPU time to the userspace, allowing users to get a better understanding of their hardware limits and can facilitate more informed workload distribution.
A new field 'ntime' is added to struct cgroup_base_stat as opposed to struct task_cputime to minimize footprint.
Signed-off-by: Joshua Hahn <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7 |
|
| #
b8247665 |
| 04-Jul-2024 |
Chen Ridong <[email protected]> |
cgroup/rstat: add force idle show helper
In the function cgroup_base_stat_cputime_show, there are five instances of #ifdef, which makes the code not concise. To address this, add the function cgroup
cgroup/rstat: add force idle show helper
In the function cgroup_base_stat_cputime_show, there are five instances of #ifdef, which makes the code not concise. To address this, add the function cgroup_force_idle_show to make the code more succinct.
Signed-off-by: Chen Ridong <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7 |
|
| #
21c38a3b |
| 01-May-2024 |
Jesper Dangaard Brouer <[email protected]> |
cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints
This closely resembles helpers added for the global cgroup_rstat_lock in commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock hel
cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints
This closely resembles helpers added for the global cgroup_rstat_lock in commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock.
Based on production workloads, we observe the fast-path "update" function cgroup_rstat_updated() is invoked around 3 million times per sec, while the "flush" function cgroup_rstat_flush_locked(), walking each possible CPU, can see periodic spikes of 700 invocations/sec.
For this reason, the tracepoints are split into normal and fastpath versions for this per-CPU lock. Making it feasible for production to continuously monitor the non-fastpath tracepoint to detect lock contention issues. The reason for monitoring is that lock disables IRQs which can disturb e.g. softirq processing on the local CPUs involved. When the global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex), this per CPU lock becomes the next bottleneck that can introduce latency variations.
A practical bpftrace script for monitoring contention latency:
bpftrace -e ' tracepoint:cgroup:cgroup_rstat_cpu_lock_contended { @start[tid]=nsecs; @cnt[probe]=count()} tracepoint:cgroup:cgroup_rstat_cpu_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);} @cnt[probe]=count()} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}'
Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc6, v6.9-rc5 |
|
| #
97a46a66 |
| 17-Apr-2024 |
Jesper Dangaard Brouer <[email protected]> |
cgroup/rstat: desc member cgrp in cgroup_rstat_flush_release
Recent change to cgroup_rstat_flush_release added a parameter cgrp, which is used by tracepoint to correlate with other tracepoints that
cgroup/rstat: desc member cgrp in cgroup_rstat_flush_release
Recent change to cgroup_rstat_flush_release added a parameter cgrp, which is used by tracepoint to correlate with other tracepoints that also have this cgrp.
The kernel test robot detected kernel doc was missing a description of this member.
Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
fc29e04a |
| 16-Apr-2024 |
Jesper Dangaard Brouer <[email protected]> |
cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
This commit enhances the ability to troubleshoot the global cgroup_rstat_lock by introducing wrapper helper functions for the lock along w
cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
This commit enhances the ability to troubleshoot the global cgroup_rstat_lock by introducing wrapper helper functions for the lock along with associated tracepoints.
Although global, the cgroup_rstat_lock helper APIs and tracepoints take arguments such as cgroup pointer and cpu_in_loop variable. This adjustment is made because flushing occurs per cgroup despite the lock being global. Hence, when troubleshooting, it's important to identify the relevant cgroup. The cpu_in_loop variable is necessary because the global lock may be released within the main flushing loop that traverses CPUs. In the tracepoints, the cpu_in_loop value is set to -1 when acquiring the main lock; otherwise, it denotes the CPU number processed last.
The new feature in this patchset is detecting when lock is contended. The tracepoints are implemented with production in mind. For minimum overhead attach to cgroup:cgroup_rstat_lock_contended, which only gets activated when trylock detects lock is contended. A quick production check for issues could be done via this perf commands:
perf record -g -e cgroup:cgroup_rstat_lock_contended
Next natural question would be asking how long time do lock contenders wait for obtaining the lock. This can be answered by measuring the time between cgroup:cgroup_rstat_lock_contended and cgroup:cgroup_rstat_locked when args->contended is set. Like this bpftrace script:
bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); }'
Extending with time spend holding the lock will be more expensive as this also looks at all the non-contended cases. Like this bpftrace script:
bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { @locked[tid]=nsecs; if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} tracepoint:cgroup:cgroup_rstat_unlock { @locked_ns=hist(nsecs-@locked[tid]); delete(@locked[tid]);} interval:s:1 {time("%H:%M:%S "); print(@wait_ns);print(@locked_ns); }'
Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3 |
|
| #
6f3189f3 |
| 29-Jan-2024 |
Daniel Xu <[email protected]> |
bpf: treewide: Annotate BPF kfuncs in BTF
This commit marks kfuncs as such inside the .BTF_ids section. The upshot of these annotations is that we'll be able to automatically generate kfunc prototyp
bpf: treewide: Annotate BPF kfuncs in BTF
This commit marks kfuncs as such inside the .BTF_ids section. The upshot of these annotations is that we'll be able to automatically generate kfunc prototypes for downstream users. The process is as follows:
1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs 2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for each function inside BTF_KFUNCS sets 3. At runtime, vmlinux or module BTF is made available in sysfs 4. At runtime, bpftool (or similar) can look at provided BTF and generate appropriate prototypes for functions with "bpf_kfunc" tag
To ensure future kfunc are similarly tagged, we now also return error inside kfunc registration for untagged kfuncs. For vmlinux kfuncs, we also WARN(), as initcall machinery does not handle errors.
Signed-off-by: Daniel Xu <[email protected]> Acked-by: Benjamin Tissoires <[email protected]> Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4 |
|
| #
d499fd41 |
| 30-Nov-2023 |
Waiman Long <[email protected]> |
cgroup/rstat: Optimize cgroup_rstat_updated_list()
The current design of cgroup_rstat_cpu_pop_updated() is to traverse the updated tree in a way to pop out the leaf nodes first before their parents.
cgroup/rstat: Optimize cgroup_rstat_updated_list()
The current design of cgroup_rstat_cpu_pop_updated() is to traverse the updated tree in a way to pop out the leaf nodes first before their parents. This can cause traversal of multiple nodes before a leaf node can be found and popped out. IOW, a given node in the tree can be visited multiple times before the whole operation is done. So it is not very efficient and the code can be hard to read.
With the introduction of cgroup_rstat_updated_list() to build a list of cgroups to be flushed first before any flushing operation is being done, we can optimize the way the updated tree nodes are being popped by pushing the parents first to the tail end of the list before their children. In this way, most updated tree nodes will be visited only once with the exception of the subtree root as we still need to go back to its parent and popped it out of its updated_children list. This also makes the code easier to read.
Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc3, v6.7-rc2, v6.7-rc1 |
|
| #
e76d28bd |
| 04-Nov-2023 |
Waiman Long <[email protected]> |
cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
When cgroup_rstat_updated() isn't being called concurrently with cgroup_rstat_flush_locked(), its run time is pretty short. Whe
cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
When cgroup_rstat_updated() isn't being called concurrently with cgroup_rstat_flush_locked(), its run time is pretty short. When both are called concurrently, the cgroup_rstat_updated() run time can spike to a pretty high value due to high cpu_lock hold time in cgroup_rstat_flush_locked(). This can be problematic if the task calling cgroup_rstat_updated() is a realtime task running on an isolated CPU with a strict latency requirement. The cgroup_rstat_updated() call can happen when there is a page fault even though the task is running in user space most of the time.
The percpu cpu_lock is used to protect the update tree - updated_next and updated_children. This protection is only needed when cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing operation which can take a much longer time does not need that protection as it is already protected by cgroup_rstat_lock.
To reduce the cpu_lock hold time, we need to perform all the cgroup_rstat_cpu_pop_updated() calls up front with the lock released afterward before doing any flushing. This patch adds a new cgroup_rstat_updated_list() function to return a singly linked list of cgroups to be flushed.
Some instrumentation code are added to measure the cpu_lock hold time right after lock acquisition to after releasing the lock. Parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time.
The maximum cpu_lock hold time before and after the patch are 100us and 29us respectively. So the worst case time is reduced to about 30% of the original. However, there may be some OS or hardware noises like NMI or SMI in the test system that can worsen the worst case value. Those noises are usually tuned out in a real production environment to get a better result.
OTOH, the lock hold time frequency distribution should give a better idea of the performance benefit of the patch. Below were the frequency distribution before and after the patch:
Hold time Before patch After patch --------- ------------ ----------- 0-01 us 804,139 13,738,708 01-05 us 9,772,767 1,177,194 05-10 us 4,595,028 4,984 10-15 us 303,481 3,562 15-20 us 78,971 1,314 20-25 us 24,583 18 25-30 us 6,908 12 30-40 us 8,015 40-50 us 2,192 50-60 us 316 60-70 us 43 70-80 us 7 80-90 us 2 >90 us 3
Signed-off-by: Waiman Long <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
15fb6f2b |
| 31-Oct-2023 |
Dave Marchevsky <[email protected]> |
bpf: Add __bpf_hook_{start,end} macros
Not all uses of __diag_ignore_all(...) in BPF-related code in order to suppress warnings are wrapping kfunc definitions. Some "hook point" definitions - small
bpf: Add __bpf_hook_{start,end} macros
Not all uses of __diag_ignore_all(...) in BPF-related code in order to suppress warnings are wrapping kfunc definitions. Some "hook point" definitions - small functions meant to be used as attach points for fentry and similar BPF progs - need to suppress -Wmissing-declarations.
We could use __bpf_kfunc_{start,end}_defs added in the previous patch in such cases, but this might be confusing to someone unfamiliar with BPF internals. Instead, this patch adds __bpf_hook_{start,end} macros, currently having the same effect as __bpf_kfunc_{start,end}_defs, then uses them to suppress warnings for two hook points in the kernel itself and some bpf_testmod hook points as well.
Signed-off-by: Dave Marchevsky <[email protected]> Cc: Yafang Shao <[email protected]> Acked-by: Jiri Olsa <[email protected]> Acked-by: Yafang Shao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6 |
|
| #
0437719c |
| 07-Aug-2023 |
Hao Jia <[email protected]> |
cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants
The member variable bstat of the structure cgroup_rstat_cpu records the per-cpu time of the cgroup itself, but does not
cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants
The member variable bstat of the structure cgroup_rstat_cpu records the per-cpu time of the cgroup itself, but does not include the per-cpu time of its descendants. The per-cpu time including descendants is very useful for calculating the per-cpu usage of cgroups.
Although we can indirectly obtain the total per-cpu time of the cgroup and its descendants by accumulating the per-cpu bstat of each descendant of the cgroup. But after a child cgroup is removed, we will lose its bstat information. This will cause the cumulative value to be non-monotonic, thus affecting the accuracy of cgroup per-cpu usage.
So we add the subtree_bstat variable to record the total per-cpu time of this cgroup and its descendants, which is similar to "cpuacct.usage*" in cgroup v1. And this is also helpful for the migration from cgroup v1 to cgroup v2. After adding this variable, we can obtain the per-cpu time of cgroup and its descendants in user mode through eBPF/drgn, etc. And we are still trying to determine how to expose it in the cgroupfs interface.
Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Hao Jia <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3 |
|
| #
0a2dc6ac |
| 21-Apr-2023 |
Yosry Ahmed <[email protected]> |
cgroup: remove cgroup_rstat_flush_atomic()
Previous patches removed the only caller of cgroup_rstat_flush_atomic(). Remove the function and simplify the code.
Link: https://lkml.kernel.org/r/20230
cgroup: remove cgroup_rstat_flush_atomic()
Previous patches removed the only caller of cgroup_rstat_flush_atomic(). Remove the function and simplify the code.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc7, v6.3-rc6, v6.3-rc5 |
|
| #
8bff9a04 |
| 30-Mar-2023 |
Yosry Ahmed <[email protected]> |
cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
Patch series "memcg: avoid flushing stats atomically where possible", v3.
rstat flushing is an expensive operation that scales with the numbe
cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
Patch series "memcg: avoid flushing stats atomically where possible", v3.
rstat flushing is an expensive operation that scales with the number of cpus and the number of cgroups in the system. The purpose of this series is to minimize the contexts where we flush stats atomically.
Patches 1 and 2 are cleanups requested during reviews of prior versions of this series.
Patch 3 makes sure we never try to flush from within an irq context.
Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for atomic and non-atomic flushing, and make sure we only flush the stats atomically when necessary.
Patch 8 is a slightly tangential optimization that limits the work done by rstat flushing in some scenarios.
This patch (of 8):
cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as "irqs are disabled throughout", which is what the current implementation does (currently under discussion [1]), but is not the intention. The intention is that this function is safe to call from atomic contexts. Name it as such.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vasily Averin <[email protected]> Cc: Zefan Li <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc4, v6.3-rc3 |
|
| #
fcdb1eda |
| 15-Mar-2023 |
Josh Don <[email protected]> |
cgroup: fix display of forceidle time at root
We need to reset forceidle_sum to 0 when reading from root, since the bstat we accumulate into is stack allocated.
To make this more robust, just repla
cgroup: fix display of forceidle time at root
We need to reset forceidle_sum to 0 when reading from root, since the bstat we accumulate into is stack allocated.
To make this more robust, just replace the existing cputime reset with a memset of the overall bstat.
Signed-off-by: Josh Don <[email protected]> Fixes: 1fcf54deb767 ("sched/core: add forced idle accounting for cgroups") Cc: [email protected] # v6.0+ Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7 |
|
| #
400031e0 |
| 01-Feb-2023 |
David Vernet <[email protected]> |
bpf: Add __bpf_kfunc tag to all kfuncs
Now that we have the __bpf_kfunc tag, we should use add it to all existing kfuncs to ensure that they'll never be elided in LTO builds.
Signed-off-by: David V
bpf: Add __bpf_kfunc tag to all kfuncs
Now that we have the __bpf_kfunc tag, we should use add it to all existing kfuncs to ensure that they'll never be elided in LTO builds.
Signed-off-by: David Vernet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8 |
|
| #
c62256dd |
| 30-Nov-2022 |
Jens Axboe <[email protected]> |
Revert "blk-cgroup: Flush stats at blkgs destruction path"
This reverts commit dae590a6c96c799434e0ff8156ef29b88c257e60.
We've had a few reports on this causing a crash at boot time, because of a r
Revert "blk-cgroup: Flush stats at blkgs destruction path"
This reverts commit dae590a6c96c799434e0ff8156ef29b88c257e60.
We've had a few reports on this causing a crash at boot time, because of a reference issue. While this problem seemginly did exist before the patch and needs solving separately, this patch makes it a lot easier to trigger.
Link: https://lore.kernel.org/linux-block/CA+QYu4oxiRKC6hJ7F27whXy-PRBx=Tvb+-7TQTONN8qTtV3aDA@mail.gmail.com/ Link: https://lore.kernel.org/linux-block/[email protected]/ Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4 |
|
| #
dae590a6 |
| 05-Nov-2022 |
Waiman Long <[email protected]> |
blk-cgroup: Flush stats at blkgs destruction path
As noted by Michal, the blkg_iostat_set's in the lockless list hold reference to blkg's to protect against their removal. Those blkg's hold referenc
blk-cgroup: Flush stats at blkgs destruction path
As noted by Michal, the blkg_iostat_set's in the lockless list hold reference to blkg's to protect against their removal. Those blkg's hold reference to blkcg. When a cgroup is being destroyed, cgroup_rstat_flush() is only called at css_release_work_fn() which is called when the blkcg reference count reaches 0. This circular dependency will prevent blkcg from being freed until some other events cause cgroup_rstat_flush() to be called to flush out the pending blkcg stats.
To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush() function to flush stats for a given css and cpu and call it at the blkgs destruction path, blkcg_destroy_blkgs(), whenever there are still some pending stats to be flushed. This will ensure that blkcg reference count can reach 0 ASAP.
Signed-off-by: Waiman Long <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3 |
|
| #
a319185b |
| 24-Aug-2022 |
Yosry Ahmed <[email protected]> |
cgroup: bpf: enable bpf programs to integrate with rstat
Enable bpf programs to make use of rstat to collect cgroup hierarchical stats efficiently: - Add cgroup_rstat_updated() kfunc, for bpf progs
cgroup: bpf: enable bpf programs to integrate with rstat
Enable bpf programs to make use of rstat to collect cgroup hierarchical stats efficiently: - Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats. - Add cgroup_rstat_flush() sleepable kfunc, for bpf progs that read stats. - Add an empty bpf_rstat_flush() hook that is called during rstat flushing, for bpf progs that flush stats to attach to. Attaching a bpf prog to this hook effectively registers it as a flush callback.
Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Hao Luo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5 |
|
| #
1fcf54de |
| 29-Jun-2022 |
Josh Don <[email protected]> |
sched/core: add forced idle accounting for cgroups
4feee7d1260 previously added per-task forced idle accounting. This patch extends this to also include cgroups.
rstat is used for cgroup accounting
sched/core: add forced idle accounting for cgroups
4feee7d1260 previously added per-task forced idle accounting. This patch extends this to also include cgroups.
rstat is used for cgroup accounting, except for the root, which uses kcpustat in order to bypass the need for doing an rstat flush when reading root stats.
Only cgroup v2 is supported. Similar to the task accounting, the cgroup accounting requires that schedstats is enabled.
Signed-off-by: Josh Don <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1 |
|
| #
b1e2c8df |
| 23-Mar-2022 |
Sebastian Andrzej Siewior <[email protected]> |
cgroup: use irqsave in cgroup_rstat_flush_locked().
All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock either with spin_lock_irq() or spin_lock_irqsave(). cgroup_rstat_flush_locked
cgroup: use irqsave in cgroup_rstat_flush_locked().
All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock either with spin_lock_irq() or spin_lock_irqsave(). cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() in IRQ context and therefore requires _irqsave() locking suffix in cgroup_rstat_flush_locked().
Since there is no difference between spin_lock_t and raw_spin_lock_t on !RT lockdep does not complain here. On RT lockdep complains because the interrupts were not disabled here and a deadlock is possible.
Acquire the raw_spin_lock_t with disabled interrupts.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Zefan Li <[email protected]> From: Sebastian Andrzej Siewior <[email protected]> Subject: cgroup: add a comment to cgroup_rstat_flush_locked().
Add a comment why spin_lock_irq() -> raw_spin_lock_irqsave() is needed.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16 |
|
| #
95b99f35 |
| 08-Jan-2022 |
Wei Yang <[email protected]> |
cgroup: rstat: retrieve current bstat to delta directly
Instead of retrieve current bstat to cur and copy it to delta, let's use delta directly.
This saves one copy operation and has the same code
cgroup: rstat: retrieve current bstat to delta directly
Instead of retrieve current bstat to cur and copy it to delta, let's use delta directly.
This saves one copy operation and has the same code convention as propagating delta to parent.
Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|