|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3 |
|
| #
1bf67c8f |
| 16-Apr-2025 |
T.J. Mercier <[email protected]> |
cgroup/cpuset-v1: Add missing support for cpuset_v2_mode
Android has mounted the v1 cpuset controller using filesystem type "cpuset" (not "cgroup") since 2015 [1], and depends on the resulting behav
cgroup/cpuset-v1: Add missing support for cpuset_v2_mode
Android has mounted the v1 cpuset controller using filesystem type "cpuset" (not "cgroup") since 2015 [1], and depends on the resulting behavior where the controller name is not added as a prefix for cgroupfs files. [2]
Later, a problem was discovered where cpu hotplug onlining did not affect the cpuset/cpus files, which Android carried an out-of-tree patch to address for a while. An attempt was made to upstream this patch, but the recommendation was to use the "cpuset_v2_mode" mount option instead. [3]
An effort was made to do so, but this fails with "cgroup: Unknown parameter 'cpuset_v2_mode'" because commit e1cba4b85daa ("cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup") did not update the special cased cpuset_mount(), and only the cgroup (v1) filesystem type was updated.
Add parameter parsing to the cpuset filesystem type so that cpuset_v2_mode works like the cgroup filesystem type:
$ mkdir /dev/cpuset $ mount -t cpuset -ocpuset_v2_mode none /dev/cpuset $ mount|grep cpuset none on /dev/cpuset type cgroup (rw,relatime,cpuset,noprefix,cpuset_v2_mode,release_agent=/sbin/cpuset_release_agent)
[1] https://cs.android.com/android/_/android/platform/system/core/+/b769c8d24fd7be96f8968aa4c80b669525b930d3 [2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libprocessgroup/setup/cgroup_map_write.cpp;drc=2dac5d89a0f024a2d0cc46a80ba4ee13472f1681;l=192 [3] https://lore.kernel.org/lkml/[email protected]/T/
Fixes: e1cba4b85daa ("cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup") Signed-off-by: T.J. Mercier <[email protected]> Acked-by: Waiman Long <[email protected]> Reviewed-by: Kamalesh Babulal <[email protected]> Acked-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
87c259a7 |
| 17-Apr-2025 |
gaoxu <[email protected]> |
cgroup: Fix compilation issue due to cgroup_mutex not being exported
When adding folio_memcg function call in the zram module for Android16-6.12, the following error occurs during compilation: ERROR
cgroup: Fix compilation issue due to cgroup_mutex not being exported
When adding folio_memcg function call in the zram module for Android16-6.12, the following error occurs during compilation: ERROR: modpost: "cgroup_mutex" [../soc-repo/zram.ko] undefined!
This error is caused by the indirect call to lockdep_is_held(&cgroup_mutex) within folio_memcg. The export setting for cgroup_mutex is controlled by the CONFIG_PROVE_RCU macro. If CONFIG_LOCKDEP is enabled while CONFIG_PROVE_RCU is not, this compilation error will occur.
To resolve this issue, add a parallel macro CONFIG_LOCKDEP control to ensure cgroup_mutex is properly exported when needed.
Signed-off-by: gao xu <[email protected]> Acked-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc2, v6.15-rc1 |
|
| #
8fa7292f |
| 05-Apr-2025 |
Thomas Gleixner <[email protected]> |
treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines.
Conversion was done with c
treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
show more ...
|
| #
a22b3d54 |
| 30-Mar-2025 |
Waiman Long <[email protected]> |
cgroup/cpuset: Fix race between newly created partition and dying one
There is a possible race between removing a cgroup diectory that is a partition root and the creation of a new partition. The p
cgroup/cpuset: Fix race between newly created partition and dying one
There is a possible race between removing a cgroup diectory that is a partition root and the creation of a new partition. The partition to be removed can be dying but still online, it doesn't not currently participate in checking for exclusive CPUs conflict, but the exclusive CPUs are still there in subpartitions_cpus and isolated_cpus. These two cpumasks are global states that affect the operation of cpuset partitions. The exclusive CPUs in dying cpusets will only be removed when cpuset_css_offline() function is called after an RCU delay.
As a result, it is possible that a new partition can be created with exclusive CPUs that overlap with those of a dying one. When that dying partition is finally offlined, it removes those overlapping exclusive CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an incorrect CPU configuration.
This bug was found when a warning was triggered in remote_partition_disable() during testing because the subpartitions_cpus mask was empty.
One possible way to fix this is to iterate the dying cpusets as well and avoid using the exclusive CPUs in those dying cpusets. However, this can still cause random partition creation failures or other anomalies due to racing. A better way to fix this race is to reset the partition state at the moment when a cpuset is being killed.
Introduce a new css_killed() CSS function pointer and call it, if defined, before setting CSS_DYING flag in kill_css(). Also update the css_is_dying() helper to use the CSS_DYING flag introduced by commit 33c35aa48178 ("cgroup: Prevent kill_css() from being called more than once") for proper synchronization.
Add a new cpuset_css_killed() function to reset the partition state of a valid partition root if it is being killed.
Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7 |
|
| #
4a893bdc |
| 11-Mar-2025 |
Michal Koutný <[email protected]> |
blk-cgroup: Simplify policy files registration
Use one set of files when there is no difference between default and legacy files, similar to regular subsys files registration. No functional change.
blk-cgroup: Simplify policy files registration
Use one set of files when there is no difference between default and legacy files, similar to regular subsys files registration. No functional change.
Signed-off-by: Michal Koutný <[email protected]> Acked-by: Jens Axboe <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
a0ab1453 |
| 11-Mar-2025 |
Michal Koutný <[email protected]> |
cgroup: Print message when /proc/cgroups is read on v2-only system
As a followup to commits 6c2920926b10e ("cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation") and ab031252
cgroup: Print message when /proc/cgroups is read on v2-only system
As a followup to commits 6c2920926b10e ("cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation") and ab03125268679 ("cgroup: Show # of subsystem CSSes in cgroup.stat"), add a runtime message to users who read status of controllers in /proc/cgroups on v2-only system. The detection is based on a) no controllers are attached to v1, b) default hierarchy is mounted (the latter is for setups that never mount v2 but read /proc/cgroups upon boot when controllers default to v2, so that this code may be backported to older kernels).
Signed-off-by: Michal Koutný <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
63348894 |
| 13-Feb-2025 |
Sebastian Andrzej Siewior <[email protected]> |
kernfs: Use RCU to access kernfs_node::parent.
kernfs_rename_lock is used to obtain stable kernfs_node::{name|parent} pointer. This is a preparation to access kernfs_node::parent under RCU and ensur
kernfs: Use RCU to access kernfs_node::parent.
kernfs_rename_lock is used to obtain stable kernfs_node::{name|parent} pointer. This is a preparation to access kernfs_node::parent under RCU and ensure that the pointer remains stable under the RCU lifetime guarantees.
For a complete path, as it is done in kernfs_path_from_node(), the kernfs_rename_lock is still required in order to obtain a stable parent relationship while computing the relevant node depth. This must not change while the nodes are inspected in order to build the path. If the kernfs user never moves the nodes (changes the parent) then the kernfs_rename_lock is not required and the RCU guarantees are sufficient. This "restriction" can be set with KERNFS_ROOT_INVARIANT_PARENT. Otherwise the lock is required.
Rename kernfs_node::parent to kernfs_node::__parent to denote the RCU access and use RCU accessor while accessing the node. Make cgroup use KERNFS_ROOT_INVARIANT_PARENT since the parent here can not change.
Acked-by: Tejun Heo <[email protected]> Cc: Yonghong Song <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1 |
|
| #
b69bb476 |
| 31-Jan-2025 |
Shakeel Butt <[email protected]> |
cgroup: fix race between fork and cgroup.kill
Tejun reported the following race between fork() and cgroup.kill at [1].
Tejun: I was looking at cgroup.kill implementation and wondering whether the
cgroup: fix race between fork and cgroup.kill
Tejun reported the following race between fork() and cgroup.kill at [1].
Tejun: I was looking at cgroup.kill implementation and wondering whether there could be a race window. So, __cgroup_kill() does the following:
k1. Set CGRP_KILL. k2. Iterate tasks and deliver SIGKILL. k3. Clear CGRP_KILL.
The copy_process() does the following:
c1. Copy a bunch of stuff. c2. Grab siglock. c3. Check fatal_signal_pending(). c4. Commit to forking. c5. Release siglock. c6. Call cgroup_post_fork() which puts the task on the css_set and tests CGRP_KILL.
The intention seems to be that either a forking task gets SIGKILL and terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I don't see what guarantees that k3 can't happen before c6. ie. After a forking task passes c5, k2 can take place and then before the forking task reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child. What am I missing?
This is indeed a race. One way to fix this race is by taking cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork() side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork() to cgroup_post_fork(). However that would be heavy handed as this adds one more potential stall scenario for cgroup.kill which is usually called under extreme situation like memory pressure.
To fix this race, let's maintain a sequence number per cgroup which gets incremented on __cgroup_kill() call. On the fork() side, the cgroup_can_fork() will cache the sequence number locally and recheck it against the cgroup's sequence number at cgroup_post_fork() site. If the sequence numbers mismatch, it means __cgroup_kill() can been called and we should send SIGKILL to the newly created task.
Reported-by: Tejun Heo <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ [1] Fixes: 661ee6280931 ("cgroup: introduce cgroup.kill") Cc: [email protected] # v5.14+ Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7 |
|
| #
4a6780a3 |
| 12-Jan-2025 |
Haorui He <[email protected]> |
cgroup: update comment about dropping cgroup kn refs
the cgroup is actually freed in css_free_rwork_fn() now the ref count of the cgroup's kernfs_node is also dropped there so we need to update the
cgroup: update comment about dropping cgroup kn refs
the cgroup is actually freed in css_free_rwork_fn() now the ref count of the cgroup's kernfs_node is also dropped there so we need to update the corresponding comment in cgroup_mkdir()
Signed-off-by: Haorui He <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
34ab26fb |
| 25-Nov-2024 |
Christian Brauner <[email protected]> |
cgroup: avoid pointless cred reference count bump
of->file->f_cred already holds a reference count that is stable during the operation.
Link: https://lore.kernel.org/r/20241125-work-cred-v2-24-68b9
cgroup: avoid pointless cred reference count bump
of->file->f_cred already holds a reference count that is stable during the operation.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
51c0bcf0 |
| 25-Nov-2024 |
Christian Brauner <[email protected]> |
tree-wide: s/revert_creds_light()/revert_creds()/g
Rename all calls to revert_creds_light() back to revert_creds().
Link: https://lore.kernel.org/r/[email protected] R
tree-wide: s/revert_creds_light()/revert_creds()/g
Rename all calls to revert_creds_light() back to revert_creds().
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
6771e004 |
| 25-Nov-2024 |
Christian Brauner <[email protected]> |
tree-wide: s/override_creds_light()/override_creds()/g
Rename all calls to override_creds_light() back to overrid_creds().
Link: https://lore.kernel.org/r/20241125-work-cred-v2-5-68b9d38bb5b2@kerne
tree-wide: s/override_creds_light()/override_creds()/g
Rename all calls to override_creds_light() back to overrid_creds().
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
f905e009 |
| 25-Nov-2024 |
Christian Brauner <[email protected]> |
tree-wide: s/revert_creds()/put_cred(revert_creds_light())/g
Convert all calls to revert_creds() over to explicitly dropping reference counts in preparation for converting revert_creds() to revert_c
tree-wide: s/revert_creds()/put_cred(revert_creds_light())/g
Convert all calls to revert_creds() over to explicitly dropping reference counts in preparation for converting revert_creds() to revert_creds_light() semantics.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
0a670e15 |
| 25-Nov-2024 |
Christian Brauner <[email protected]> |
tree-wide: s/override_creds()/override_creds_light(get_new_cred())/g
Convert all callers from override_creds() to override_creds_light(get_new_cred()) in preparation of making override_creds() not t
tree-wide: s/override_creds()/override_creds_light(get_new_cred())/g
Convert all callers from override_creds() to override_creds_light(get_new_cred()) in preparation of making override_creds() not take a separate reference at all.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
457a6549 |
| 02-Jun-2024 |
Al Viro <[email protected]> |
css_set_fork(): switch to CLASS(fd_raw, ...)
reference acquired there by fget_raw() is not stashed anywhere - we could as well borrow instead.
Reviewed-by: Christian Brauner <[email protected]> Si
css_set_fork(): switch to CLASS(fd_raw, ...)
reference acquired there by fget_raw() is not stashed anywhere - we could as well borrow instead.
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
04818199 |
| 01-Jun-2024 |
Al Viro <[email protected]> |
fdget_raw() users: switch to CLASS(fd_raw)
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
|
| #
2190df6c |
| 18-Oct-2024 |
Chen Ridong <[email protected]> |
cgroup/bpf: only cgroup v2 can be attached by bpf programs
Only cgroup v2 can be attached by bpf programs, so this patch introduces that cgroup_bpf_inherit and cgroup_bpf_offline can only be called
cgroup/bpf: only cgroup v2 can be attached by bpf programs
Only cgroup v2 can be attached by bpf programs, so this patch introduces that cgroup_bpf_inherit and cgroup_bpf_offline can only be called in cgroup v2, and this can fix the memleak mentioned by commit 04f8ef5643bc ("cgroup: Fix memory leak caused by missing cgroup_bpf_offline"), which has been reverted.
Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path") Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/ Signed-off-by: Chen Ridong <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
feb301c6 |
| 18-Oct-2024 |
Chen Ridong <[email protected]> |
Revert "cgroup: Fix memory leak caused by missing cgroup_bpf_offline"
This reverts commit 04f8ef5643bcd8bcde25dfdebef998aea480b2ba.
Only cgroup v2 can be attached by cgroup by BPF programs. Revert
Revert "cgroup: Fix memory leak caused by missing cgroup_bpf_offline"
This reverts commit 04f8ef5643bcd8bcde25dfdebef998aea480b2ba.
Only cgroup v2 can be attached by cgroup by BPF programs. Revert this commit and cgroup_bpf_inherit and cgroup_bpf_offline won't be called in cgroup v1. The memory leak issue will be fixed with next patch.
Fixes: 04f8ef5643bc ("cgroup: Fix memory leak caused by missing cgroup_bpf_offline") Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/ Signed-off-by: Chen Ridong <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
3cc4e13b |
| 12-Oct-2024 |
Xiu Jianfeng <[email protected]> |
cgroup: Fix potential overflow issue when checking max_depth
cgroup.max.depth is the maximum allowed descent depth below the current cgroup. If the actual descent depth is equal or larger, an attemp
cgroup: Fix potential overflow issue when checking max_depth
cgroup.max.depth is the maximum allowed descent depth below the current cgroup. If the actual descent depth is equal or larger, an attempt to create a new child cgroup will fail. However due to the cgroup->max_depth is of int type and having the default value INT_MAX, the condition 'level > cgroup->max_depth' will never be satisfied, and it will cause an overflow of the level after it reaches to INT_MAX.
Fix it by starting the level from 0 and using '>=' instead.
It's worth mentioning that this issue is unlikely to occur in reality, as it's impossible to have a depth of INT_MAX hierarchy, but should be be avoided logically.
Fixes: 1a926e0bbab8 ("cgroup: implement hierarchy limits") Signed-off-by: Xiu Jianfeng <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
659f90f8 |
| 09-Sep-2024 |
Michal Koutný <[email protected]> |
cgroup/cpuset: Expose cpuset filesystem with cpuset v1 only
The cpuset filesystem is a legacy interface to cpuset controller with (pre-)v1 features. It makes little sense to co-mount it on systems w
cgroup/cpuset: Expose cpuset filesystem with cpuset v1 only
The cpuset filesystem is a legacy interface to cpuset controller with (pre-)v1 features. It makes little sense to co-mount it on systems without cpuset v1, so do not build it when cpuset v1 is not built neither.
Signed-off-by: Michal Koutný <[email protected]> Reviewed-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
0e40cf2a |
| 05-Sep-2024 |
Kinsey Ho <[email protected]> |
cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
Patch series "Improve mem_cgroup_iter()", v4.
Incremental cgroup iteration is being used again [1]. This patchset improves th
cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
Patch series "Improve mem_cgroup_iter()", v4.
Incremental cgroup iteration is being used again [1]. This patchset improves the reliability of mem_cgroup_iter(). It also improves simplicity and code readability.
[1] https://lore.kernel.org/[email protected]/
This patch (of 5):
Explicitly document that css sibling/descendant linkage is protected by cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and similar functions that it isn't necessary to hold a ref on @pos.
The following changes in this patchset rely on this clarification for simplification in memcg iteration code.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Yosry Ahmed <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Kinsey Ho <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zefan Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: T.J. Mercier <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
c6f53ed8 |
| 29-Jul-2024 |
David Finkel <[email protected]> |
mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
This patch (of 2):
Other mechanisms for querying the peak memory usage of e
mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
This patch (of 2):
Other mechanisms for querying the peak memory usage of either a process or v1 memory cgroup allow for resetting the high watermark. Restore parity with those mechanisms, but with a less racy API.
For example: - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets the high watermark. - writing "5" to the clear_refs pseudo-file in a processes's proc directory resets the peak RSS.
This change is an evolution of a previous patch, which mostly copied the cgroup v1 behavior, however, there were concerns about races/ownership issues with a global reset, so instead this change makes the reset filedescriptor-local.
Writing any non-empty string to the memory.peak and memory.swap.peak pseudo-files reset the high watermark to the current usage for subsequent reads through that same FD.
Notably, following Johannes's suggestion, this implementation moves the O(FDs that have written) behavior onto the FD write(2) path. Instead, on the page-allocation path, we simply add one additional watermark to conditionally bump per-hierarchy level in the page-counter.
Additionally, this takes Longman's suggestion of nesting the page-charging-path checks for the two watermarks to reduce the number of common-case comparisons.
This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems.
Most notably, Vimeo's use-case involves a system that's doing global binpacking across many Kubernetes pods/containers, and while we can use PSI for some local decisions about overload, we strive to avoid packing workloads too tightly in the first place. To facilitate this, we track the peak memory usage. However, since we run with long-lived workers (to amortize startup costs) we need a way to track the high watermark while a work-item is executing. Polling runs the risk of missing short spikes that last for timescales below the polling interval, and peak memory tracking at the cgroup level is otherwise perfect for this use-case.
As this data is used to ensure that binpacked work ends up with sufficient headroom, this use-case mostly avoids the inaccuracies surrounding reclaimable memory.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Finkel <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Suggested-by: Waiman Long <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
d1a92d2d |
| 15-Aug-2024 |
Chen Ridong <[email protected]> |
cgroup: update some statememt about delegation
The comment in cgroup_file_write is missing some interfaces, such as 'cgroup.threads'. All delegatable files are listed in '/sys/kernel/cgroup/delegate
cgroup: update some statememt about delegation
The comment in cgroup_file_write is missing some interfaces, such as 'cgroup.threads'. All delegatable files are listed in '/sys/kernel/cgroup/delegate', so update the comment in cgroup_file_write. Besides, add a statement that files outside the namespace shouldn't be visible from inside the delegated namespace.
tj: Reflowed text for consistency.
Signed-off-by: Chen Ridong <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
1da91ea8 |
| 31-May-2024 |
Al Viro <[email protected]> |
introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are ve
introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict]
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
9b103943 |
| 09-Aug-2024 |
Waiman Long <[email protected]> |
cgroup: Fix incorrect WARN_ON_ONCE() in css_release_work_fn()
It turns out that the WARN_ON_ONCE() call in css_release_work_fn introduced by commit ab0312526867 ("cgroup: Show # of subsystem CSSes i
cgroup: Fix incorrect WARN_ON_ONCE() in css_release_work_fn()
It turns out that the WARN_ON_ONCE() call in css_release_work_fn introduced by commit ab0312526867 ("cgroup: Show # of subsystem CSSes in cgroup.stat") is incorrect. Although css->nr_descendants must be 0 when a css is released and ready to be freed, the corresponding cgrp->nr_dying_subsys[ss->id] may not be 0 if a subsystem is activated and deactivated multiple times with one or more of its previous activation leaving behind dying csses.
Fix the incorrect warning by removing the cgrp->nr_dying_subsys check.
Fixes: ab0312526867 ("cgroup: Show # of subsystem CSSes in cgroup.stat") Closes: https://lore.kernel.org/cgroups/[email protected]/T/#t Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|