|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
a22b3d54 |
| 30-Mar-2025 |
Waiman Long <[email protected]> |
cgroup/cpuset: Fix race between newly created partition and dying one
There is a possible race between removing a cgroup diectory that is a partition root and the creation of a new partition. The p
cgroup/cpuset: Fix race between newly created partition and dying one
There is a possible race between removing a cgroup diectory that is a partition root and the creation of a new partition. The partition to be removed can be dying but still online, it doesn't not currently participate in checking for exclusive CPUs conflict, but the exclusive CPUs are still there in subpartitions_cpus and isolated_cpus. These two cpumasks are global states that affect the operation of cpuset partitions. The exclusive CPUs in dying cpusets will only be removed when cpuset_css_offline() function is called after an RCU delay.
As a result, it is possible that a new partition can be created with exclusive CPUs that overlap with those of a dying one. When that dying partition is finally offlined, it removes those overlapping exclusive CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an incorrect CPU configuration.
This bug was found when a warning was triggered in remote_partition_disable() during testing because the subpartitions_cpus mask was empty.
One possible way to fix this is to iterate the dying cpusets as well and avoid using the exclusive CPUs in those dying cpusets. However, this can still cause random partition creation failures or other anomalies due to racing. A better way to fix this race is to reset the partition state at the moment when a cpuset is being killed.
Introduce a new css_killed() CSS function pointer and call it, if defined, before setting CSS_DYING flag in kill_css(). Also update the css_is_dying() helper to use the CSS_DYING flag introduced by commit 33c35aa48178 ("cgroup: Prevent kill_css() from being called more than once") for proper synchronization.
Add a new cpuset_css_killed() function to reset the partition state of a valid partition root if it is being killed.
Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14 |
|
| #
093c8812 |
| 19-Mar-2025 |
Yosry Ahmed <[email protected]> |
cgroup: rstat: Cleanup flushing functions and locking
Now that the rstat lock is being re-acquired on every CPU iteration in cgroup_rstat_flush_locked(), having the initially acquire the lock is unn
cgroup: rstat: Cleanup flushing functions and locking
Now that the rstat lock is being re-acquired on every CPU iteration in cgroup_rstat_flush_locked(), having the initially acquire the lock is unnecessary and unclear.
Inline cgroup_rstat_flush_locked() into cgroup_rstat_flush() and move the lock/unlock calls to the beginning and ending of the loop body to make the critical section obvious.
cgroup_rstat_flush_hold/release() do not make much sense with the lock being dropped and reacquired internally. Since it has no external callers, remove it and explicitly acquire the lock in cgroup_base_stat_cputime_show() instead.
This leaves the code with a single flushing function, cgroup_rstat_flush().
Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7 |
|
| #
4a893bdc |
| 11-Mar-2025 |
Michal Koutný <[email protected]> |
blk-cgroup: Simplify policy files registration
Use one set of files when there is no difference between default and legacy files, similar to regular subsys files registration. No functional change.
blk-cgroup: Simplify policy files registration
Use one set of files when there is no difference between default and legacy files, similar to regular subsys files registration. No functional change.
Signed-off-by: Michal Koutný <[email protected]> Acked-by: Jens Axboe <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
c6f53ed8 |
| 29-Jul-2024 |
David Finkel <[email protected]> |
mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
This patch (of 2):
Other mechanisms for querying the peak memory usage of e
mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
This patch (of 2):
Other mechanisms for querying the peak memory usage of either a process or v1 memory cgroup allow for resetting the high watermark. Restore parity with those mechanisms, but with a less racy API.
For example: - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets the high watermark. - writing "5" to the clear_refs pseudo-file in a processes's proc directory resets the peak RSS.
This change is an evolution of a previous patch, which mostly copied the cgroup v1 behavior, however, there were concerns about races/ownership issues with a global reset, so instead this change makes the reset filedescriptor-local.
Writing any non-empty string to the memory.peak and memory.swap.peak pseudo-files reset the high watermark to the current usage for subsequent reads through that same FD.
Notably, following Johannes's suggestion, this implementation moves the O(FDs that have written) behavior onto the FD write(2) path. Instead, on the page-allocation path, we simply add one additional watermark to conditionally bump per-hierarchy level in the page-counter.
Additionally, this takes Longman's suggestion of nesting the page-charging-path checks for the two watermarks to reduce the number of common-case comparisons.
This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems.
Most notably, Vimeo's use-case involves a system that's doing global binpacking across many Kubernetes pods/containers, and while we can use PSI for some local decisions about overload, we strive to avoid packing workloads too tightly in the first place. To facilitate this, we track the peak memory usage. However, since we run with long-lived workers (to amortize startup costs) we need a way to track the high watermark while a work-item is executing. Polling runs the risk of missing short spikes that last for timescales below the polling interval, and peak memory tracking at the cgroup level is otherwise perfect for this use-case.
As this data is used to ensure that binpacked work ends up with sufficient headroom, this use-case mostly avoids the inaccuracies surrounding reclaimable memory.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Finkel <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Suggested-by: Waiman Long <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
7f36688f |
| 28-May-2024 |
Yury Norov <[email protected]> |
cpumask: cleanup core headers inclusion
Many core headers include cpumask.h for nothing. Drop it.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yury No
cpumask: cleanup core headers inclusion
Many core headers include cpumask.h for nothing. Drop it.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yury Norov <[email protected]> Cc: Amit Daniel Kachhap <[email protected]> Cc: Anna-Maria Behnsen <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ulf Hansson <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Viresh Kumar <[email protected]> Cc: Yury Norov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
4f9c7ca8 |
| 18-Jun-2024 |
Tejun Heo <[email protected]> |
sched: Factor out cgroup weight conversion functions
Factor out sched_weight_from/to_cgroup() which convert between scheduler shares and cgroup weight. No functional change. The factored out functio
sched: Factor out cgroup weight conversion functions
Factor out sched_weight_from/to_cgroup() which convert between scheduler shares and cgroup weight. No functional change. The factored out functions will be used by a new BPF extensible sched_class so that the weights can be exposed to the BPF programs in a way which is consistent cgroup weights and easier to interpret.
The weight conversions will be used regardless of cgroup usage. It's just borrowing the cgroup weight range as it's more intuitive. CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that the conversion helpers can always be defined.
v2: The helpers are now defined regardless of COFNIG_CGROUPS.
Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: David Vernet <[email protected]> Acked-by: Josh Don <[email protected]> Acked-by: Hao Luo <[email protected]> Acked-by: Barret Rhoden <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5 |
|
| #
fc29e04a |
| 16-Apr-2024 |
Jesper Dangaard Brouer <[email protected]> |
cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
This commit enhances the ability to troubleshoot the global cgroup_rstat_lock by introducing wrapper helper functions for the lock along w
cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
This commit enhances the ability to troubleshoot the global cgroup_rstat_lock by introducing wrapper helper functions for the lock along with associated tracepoints.
Although global, the cgroup_rstat_lock helper APIs and tracepoints take arguments such as cgroup pointer and cpu_in_loop variable. This adjustment is made because flushing occurs per cgroup despite the lock being global. Hence, when troubleshooting, it's important to identify the relevant cgroup. The cpu_in_loop variable is necessary because the global lock may be released within the main flushing loop that traverses CPUs. In the tracepoints, the cpu_in_loop value is set to -1 when acquiring the main lock; otherwise, it denotes the CPU number processed last.
The new feature in this patchset is detecting when lock is contended. The tracepoints are implemented with production in mind. For minimum overhead attach to cgroup:cgroup_rstat_lock_contended, which only gets activated when trylock detects lock is contended. A quick production check for issues could be done via this perf commands:
perf record -g -e cgroup:cgroup_rstat_lock_contended
Next natural question would be asking how long time do lock contenders wait for obtaining the lock. This can be answered by measuring the time between cgroup:cgroup_rstat_lock_contended and cgroup:cgroup_rstat_locked when args->contended is set. Like this bpftrace script:
bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); }'
Extending with time spend holding the lock will be more expensive as this also looks at all the non-contended cases. Like this bpftrace script:
bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { @locked[tid]=nsecs; if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} tracepoint:cgroup:cgroup_rstat_unlock { @locked_ns=hist(nsecs-@locked[tid]); delete(@locked[tid]);} interval:s:1 {time("%H:%M:%S "); print(@wait_ns);print(@locked_ns); }'
Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6 |
|
| #
aecd408b |
| 29-Oct-2023 |
Yafang Shao <[email protected]> |
cgroup: Add a new helper for cgroup1 hierarchy
A new helper is added for cgroup1 hierarchy:
- task_get_cgroup1 Acquires the associated cgroup of a task within a specific cgroup1 hierarchy. The
cgroup: Add a new helper for cgroup1 hierarchy
A new helper is added for cgroup1 hierarchy:
- task_get_cgroup1 Acquires the associated cgroup of a task within a specific cgroup1 hierarchy. The cgroup1 hierarchy is identified by its hierarchy ID.
This helper function is added to facilitate the tracing of tasks within a particular container or cgroup dir in BPF programs. It's important to note that this helper is designed specifically for cgroup1 only.
tj: Use irsqsave/restore as suggested by Hou Tao <[email protected]>.
Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Cc: Hou Tao <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc7 |
|
| #
6da88306 |
| 18-Oct-2023 |
Chuyi Zhou <[email protected]> |
cgroup: Prepare for using css_task_iter_*() in BPF
This patch makes some preparations for using css_task_iter_*() in BPF Program.
1. Flags CSS_TASK_ITER_* are #define-s and it's not easy for bpf pr
cgroup: Prepare for using css_task_iter_*() in BPF
This patch makes some preparations for using css_task_iter_*() in BPF Program.
1. Flags CSS_TASK_ITER_* are #define-s and it's not easy for bpf prog to use them. Convert them to enum so bpf prog can take them from vmlinux.h.
2. In the next patch we will add css_task_iter_*() in common kfuncs which is not safe. Since css_task_iter_*() does spin_unlock_irq() which might screw up irq flags depending on the context where bpf prog is running. So we should use irqsave/irqrestore here and the switching is harmless.
Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Chuyi Zhou <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6 |
|
| #
d16b3af4 |
| 10-Jun-2023 |
Miaohe Lin <[email protected]> |
cgroup: remove unused task_cgroup_path()
task_cgroup_path() is not used anymore. So remove it.
Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
|
|
Revision tags: v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3 |
|
| #
0a2dc6ac |
| 21-Apr-2023 |
Yosry Ahmed <[email protected]> |
cgroup: remove cgroup_rstat_flush_atomic()
Previous patches removed the only caller of cgroup_rstat_flush_atomic(). Remove the function and simplify the code.
Link: https://lkml.kernel.org/r/20230
cgroup: remove cgroup_rstat_flush_atomic()
Previous patches removed the only caller of cgroup_rstat_flush_atomic(). Remove the function and simplify the code.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc7, v6.3-rc6, v6.3-rc5 |
|
| #
8bff9a04 |
| 30-Mar-2023 |
Yosry Ahmed <[email protected]> |
cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
Patch series "memcg: avoid flushing stats atomically where possible", v3.
rstat flushing is an expensive operation that scales with the numbe
cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
Patch series "memcg: avoid flushing stats atomically where possible", v3.
rstat flushing is an expensive operation that scales with the number of cpus and the number of cgroups in the system. The purpose of this series is to minimize the contexts where we flush stats atomically.
Patches 1 and 2 are cleanups requested during reviews of prior versions of this series.
Patch 3 makes sure we never try to flush from within an irq context.
Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for atomic and non-atomic flushing, and make sure we only flush the stats atomically when necessary.
Patch 8 is a slightly tangential optimization that limits the work done by rstat flushing in some scenarios.
This patch (of 8):
cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as "irqs are disabled throughout", which is what the current implementation does (currently under discussion [1]), but is not the intention. The intention is that this function is safe to call from atomic contexts. Name it as such.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vasily Averin <[email protected]> Cc: Zefan Li <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1 |
|
| #
4a7ba45b |
| 08-Dec-2022 |
Tejun Heo <[email protected]> |
memcg: fix possible use-after-free in memcg_write_event_control()
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface
memcg: fix possible use-after-free in memcg_write_event_control()
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too.
Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's.
Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type.
Link: https://lkml.kernel.org/r/Y5FRm/[email protected] Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Signed-off-by: Tejun Heo <[email protected]> Reported-by: Jann Horn <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: <[email protected]> [3.14+] Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
fbf83212 |
| 08-Dec-2022 |
Tejun Heo <[email protected]> |
memcg: Fix possible use-after-free in memcg_write_event_control()
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface
memcg: Fix possible use-after-free in memcg_write_event_control()
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too.
Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's.
Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type.
Signed-off-by: Tejun Heo <[email protected]> Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Cc: [email protected] # v3.14+ Reported-by: Jann Horn <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc8 |
|
| #
c62256dd |
| 30-Nov-2022 |
Jens Axboe <[email protected]> |
Revert "blk-cgroup: Flush stats at blkgs destruction path"
This reverts commit dae590a6c96c799434e0ff8156ef29b88c257e60.
We've had a few reports on this causing a crash at boot time, because of a r
Revert "blk-cgroup: Flush stats at blkgs destruction path"
This reverts commit dae590a6c96c799434e0ff8156ef29b88c257e60.
We've had a few reports on this causing a crash at boot time, because of a reference issue. While this problem seemginly did exist before the patch and needs solving separately, this patch makes it a lot easier to trigger.
Link: https://lore.kernel.org/linux-block/CA+QYu4oxiRKC6hJ7F27whXy-PRBx=Tvb+-7TQTONN8qTtV3aDA@mail.gmail.com/ Link: https://lore.kernel.org/linux-block/[email protected]/ Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4 |
|
| #
dae590a6 |
| 05-Nov-2022 |
Waiman Long <[email protected]> |
blk-cgroup: Flush stats at blkgs destruction path
As noted by Michal, the blkg_iostat_set's in the lockless list hold reference to blkg's to protect against their removal. Those blkg's hold referenc
blk-cgroup: Flush stats at blkgs destruction path
As noted by Michal, the blkg_iostat_set's in the lockless list hold reference to blkg's to protect against their removal. Those blkg's hold reference to blkcg. When a cgroup is being destroyed, cgroup_rstat_flush() is only called at css_release_work_fn() which is called when the blkcg reference count reaches 0. This circular dependency will prevent blkcg from being freed until some other events cause cgroup_rstat_flush() to be called to flush out the pending blkcg stats.
To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush() function to flush stats for a given css and cpu and call it at the blkgs destruction path, blkcg_destroy_blkgs(), whenever there are still some pending stats to be flushed. This will ensure that blkcg reference count can reach 0 ASAP.
Signed-off-by: Waiman Long <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
79a7f41f |
| 31-Oct-2022 |
Tejun Heo <[email protected]> |
cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REF
6ab428604f72 ("cgroup: Implement DEBUG_CGROUP_REF") added a config option which forces cgroup refcnt functions to be n
cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REF
6ab428604f72 ("cgroup: Implement DEBUG_CGROUP_REF") added a config option which forces cgroup refcnt functions to be not inlined so that they can be kprobed for debugging. However, it forgot export them when the config is enabled breaking modules which make use of css reference counting.
Fix it by adding CGROUP_REF_EXPORT() macro to cgroup_refcnt.h which is defined to EXPORT_SYMBOL_GPL when CONFIG_DEBUG_CGROUP_REF is set.
Signed-off-by: Tejun Heo <[email protected]> Fixes: 6ab428604f72 ("cgroup: Implement DEBUG_CGROUP_REF")
show more ...
|
|
Revision tags: v6.1-rc3 |
|
| #
6ab42860 |
| 28-Oct-2022 |
Tejun Heo <[email protected]> |
cgroup: Implement DEBUG_CGROUP_REF
It's really difficult to debug when cgroup or css refs leak. Let's add a debug option to force the refcnt function to not be inlined so that they can be kprobed fo
cgroup: Implement DEBUG_CGROUP_REF
It's really difficult to debug when cgroup or css refs leak. Let's add a debug option to force the refcnt function to not be inlined so that they can be kprobed for debugging.
Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc2, v6.1-rc1 |
|
| #
a6d1ce59 |
| 11-Oct-2022 |
Yosry Ahmed <[email protected]> |
cgroup: add cgroup_v1v2_get_from_[fd/file]()
Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that support both cgroup1 and cgroup2.
Signed-off-by: Yosry Ahmed <[email protected]>
cgroup: add cgroup_v1v2_get_from_[fd/file]()
Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that support both cgroup1 and cgroup2.
Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.0, v6.0-rc7, v6.0-rc6 |
|
| #
354ed597 |
| 18-Sep-2022 |
Yu Zhao <[email protected]> |
mm: multi-gen LRU: kill switch
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that can be disabled include: 0x0001: the multi-gen LRU core 0x0002: walking page table, when arch_
mm: multi-gen LRU: kill switch
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that can be disabled include: 0x0001: the multi-gen LRU core 0x0002: walking page table, when arch_has_hw_pte_young() returns true 0x0004: clearing the accessed bit in non-leaf PMD entries, when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y [yYnN]: apply to all the components above E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0007 echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005
NB: the page table walks happen on the scale of seconds under heavy memory pressure, in which case the mmap_lock contention is a lesser concern, compared with the LRU lock contention and the I/O congestion. So far the only well-known case of the mmap_lock contention happens on Android, due to Scudo [1] which allocates several thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree also have provided their own assessments [2][3]. However, if walking page tables does worsen the mmap_lock contention, the kill switch can be used to disable it. In this case the multi-gen LRU will suffer a minor performance degradation, as shown previously.
Clearing the accessed bit in non-leaf PMD entries can also be disabled, since this behavior was not tested on x86 varieties other than Intel and AMD.
[1] https://source.android.com/devices/tech/debug/scudo [2] https://lore.kernel.org/r/[email protected]/ [3] https://lore.kernel.org/r/[email protected]/
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Acked-by: Brian Geffon <[email protected]> Acked-by: Jan Alexander Steffens (heftig) <[email protected]> Acked-by: Oleksandr Natalenko <[email protected]> Acked-by: Steven Barrett <[email protected]> Acked-by: Suleiman Souhlal <[email protected]> Tested-by: Daniel Byrne <[email protected]> Tested-by: Donald Carr <[email protected]> Tested-by: Holger Hoffstätte <[email protected]> Tested-by: Konstantin Kharlamov <[email protected]> Tested-by: Shuang Zhai <[email protected]> Tested-by: Sofia Trinh <[email protected]> Tested-by: Vaibhav Jain <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Barry Song <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michael Larabel <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc5, v6.0-rc4, v6.0-rc3 |
|
| #
57899a66 |
| 25-Aug-2022 |
Chengming Zhou <[email protected]> |
sched/psi: Consolidate cgroup_psi()
cgroup_psi() can't return psi_group for root cgroup, so we have many open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".
This patch move cgroup_ps
sched/psi: Consolidate cgroup_psi()
cgroup_psi() can't return psi_group for root cgroup, so we have many open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".
This patch move cgroup_psi() definition to <linux/psi.h>, in which we can return psi_system for root cgroup, so can handle all cgroups.
Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
e2691f6b |
| 28-Aug-2022 |
Tejun Heo <[email protected]> |
cgroup: Implement cgroup_file_show()
Add cgroup_file_show() which allows toggling visibility of a cgroup file using the new kernfs_show(). This will be used to hide psi interface files on cgroups wh
cgroup: Implement cgroup_file_show()
Add cgroup_file_show() which allows toggling visibility of a cgroup file using the new kernfs_show(). This will be used to hide psi interface files on cgroups where it's disabled.
Cc: Chengming Zhou <[email protected]> Cc: Johannes Weiner <[email protected]> Tested-by: Chengming Zhou <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
| #
fa7e439c |
| 26-Aug-2022 |
Michal Koutný <[email protected]> |
cgroup: Homogenize cgroup_get_from_id() return value
Cgroup id is user provided datum hence extend its return domain to include possible error reason (similar to cgroup_get_from_fd()).
This change
cgroup: Homogenize cgroup_get_from_id() return value
Cgroup id is user provided datum hence extend its return domain to include possible error reason (similar to cgroup_get_from_fd()).
This change also fixes commit d4ccaf58a847 ("bpf: Introduce cgroup iter") that would use NULL instead of proper error handling in d4ccaf58a847 ("bpf: Introduce cgroup iter").
Additionally, neither of: fc_appid_store, bpf_iter_attach_cgroup, mem_cgroup_get_from_ino (callers of cgroup_get_from_fd) is built without CONFIG_CGROUPS (depends via CONFIG_BLK_CGROUP, direct, transitive CONFIG_MEMCG respectively) transitive, so drop the singular definition not needed with !CONFIG_CGROUPS.
Fixes: d4ccaf58a847 ("bpf: Introduce cgroup iter") Signed-off-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc2, v6.0-rc1 |
|
| #
d7ae5818 |
| 06-Aug-2022 |
Hao Jia <[email protected]> |
sched/psi: Remove redundant cgroup_psi() when !CONFIG_CGROUPS
cgroup_psi() is only called under CONFIG_CGROUPS. We don't need cgroup_psi() when !CONFIG_CGROUPS, so we can remove it in this case.
Si
sched/psi: Remove redundant cgroup_psi() when !CONFIG_CGROUPS
cgroup_psi() is only called under CONFIG_CGROUPS. We don't need cgroup_psi() when !CONFIG_CGROUPS, so we can remove it in this case.
Signed-off-by: Hao Jia <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v5.19 |
|
| #
7f203bc8 |
| 29-Jul-2022 |
Tejun Heo <[email protected]> |
cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[]
Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's no advantage to remembering the IDs instead of the pointers direc
cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[]
Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's no advantage to remembering the IDs instead of the pointers directly and this makes the array useless for finding an actual ancestor cgroup forcing cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up from cgroup_ancestor().
While at it, improve comments around cgroup_root->cgrp_ancestor_storage.
This patch shouldn't cause user-visible behavior differences.
v2: Update cgroup_ancestor() to use ->ancestors[].
v3: cgroup_root->cgrp_ancestor_storage's type is updated to match cgroup->ancestors[]. Better comments.
Signed-off-by: Tejun Heo <[email protected]> Acked-by: Namhyung Kim <[email protected]>
show more ...
|