|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
a22b3d54 |
| 30-Mar-2025 |
Waiman Long <[email protected]> |
cgroup/cpuset: Fix race between newly created partition and dying one
There is a possible race between removing a cgroup diectory that is a partition root and the creation of a new partition. The p
cgroup/cpuset: Fix race between newly created partition and dying one
There is a possible race between removing a cgroup diectory that is a partition root and the creation of a new partition. The partition to be removed can be dying but still online, it doesn't not currently participate in checking for exclusive CPUs conflict, but the exclusive CPUs are still there in subpartitions_cpus and isolated_cpus. These two cpumasks are global states that affect the operation of cpuset partitions. The exclusive CPUs in dying cpusets will only be removed when cpuset_css_offline() function is called after an RCU delay.
As a result, it is possible that a new partition can be created with exclusive CPUs that overlap with those of a dying one. When that dying partition is finally offlined, it removes those overlapping exclusive CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an incorrect CPU configuration.
This bug was found when a warning was triggered in remote_partition_disable() during testing because the subpartitions_cpus mask was empty.
One possible way to fix this is to iterate the dying cpusets as well and avoid using the exclusive CPUs in those dying cpusets. However, this can still cause random partition creation failures or other anomalies due to racing. A better way to fix this race is to reset the partition state at the moment when a cpuset is being killed.
Introduce a new css_killed() CSS function pointer and call it, if defined, before setting CSS_DYING flag in kill_css(). Also update the css_is_dying() helper to use the CSS_DYING flag introduced by commit 33c35aa48178 ("cgroup: Prevent kill_css() from being called more than once") for proper synchronization.
Add a new cpuset_css_killed() function to reset the partition state of a valid partition root if it is being killed.
Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7 |
|
| #
76f9409f |
| 11-Mar-2025 |
Michal Koutný <[email protected]> |
cgroup: Update file naming comment
This changed long time ago in commit 8d7e6fb0a1db9 ("cgroup: update cgroup name handling").
Signed-off-by: Michal Koutný <[email protected]> Signed-off-by: Tejun H
cgroup: Update file naming comment
This changed long time ago in commit 8d7e6fb0a1db9 ("cgroup: update cgroup name handling").
Signed-off-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
b69bb476 |
| 31-Jan-2025 |
Shakeel Butt <[email protected]> |
cgroup: fix race between fork and cgroup.kill
Tejun reported the following race between fork() and cgroup.kill at [1].
Tejun: I was looking at cgroup.kill implementation and wondering whether the
cgroup: fix race between fork and cgroup.kill
Tejun reported the following race between fork() and cgroup.kill at [1].
Tejun: I was looking at cgroup.kill implementation and wondering whether there could be a race window. So, __cgroup_kill() does the following:
k1. Set CGRP_KILL. k2. Iterate tasks and deliver SIGKILL. k3. Clear CGRP_KILL.
The copy_process() does the following:
c1. Copy a bunch of stuff. c2. Grab siglock. c3. Check fatal_signal_pending(). c4. Commit to forking. c5. Release siglock. c6. Call cgroup_post_fork() which puts the task on the css_set and tests CGRP_KILL.
The intention seems to be that either a forking task gets SIGKILL and terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I don't see what guarantees that k3 can't happen before c6. ie. After a forking task passes c5, k2 can take place and then before the forking task reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child. What am I missing?
This is indeed a race. One way to fix this race is by taking cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork() side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork() to cgroup_post_fork(). However that would be heavy handed as this adds one more potential stall scenario for cgroup.kill which is usually called under extreme situation like memory pressure.
To fix this race, let's maintain a sequence number per cgroup which gets incremented on __cgroup_kill() call. On the fork() side, the cgroup_can_fork() will cache the sequence number locally and recheck it against the cgroup's sequence number at cgroup_post_fork() site. If the sequence numbers mismatch, it means __cgroup_kill() can been called and we should send SIGKILL to the newly created task.
Reported-by: Tejun Heo <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ [1] Fixes: 661ee6280931 ("cgroup: introduce cgroup.kill") Cc: [email protected] # v5.14+ Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5 |
|
| #
ee1251fc |
| 22-Oct-2024 |
Chen Ridong <[email protected]> |
cgroup/freezer: Reduce redundant traversal for cgroup_freeze
Whether a cgroup is frozen is determined solely by whether it is set to to be frozen and whether its parent is frozen. Currently, when is
cgroup/freezer: Reduce redundant traversal for cgroup_freeze
Whether a cgroup is frozen is determined solely by whether it is set to to be frozen and whether its parent is frozen. Currently, when is cgroup is frozen or unfrozen, it iterates through the entire subtree to freeze or unfreeze its descentdants. However, this is unesessary for a cgroup that does not change its effective frozen status. This path aims to skip the subtree if its parent does not have a change in effective freeze.
For an example, subtree like, a-b-c-d-e-f-g, when a is frozen, the entire tree is frozen. If we freeze b and c again, it is unesessary to iterate d, e, f and g. So does that If we unfreeze b/c.
Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Chen Ridong <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
aefa398d |
| 02-Oct-2024 |
Joshua Hahn <[email protected]> |
cgroup/rstat: Tracking cgroup-level niced CPU time
Cgroup-level CPU statistics currently include time spent on user/system processes, but do not include niced CPU time (despite already being tracked
cgroup/rstat: Tracking cgroup-level niced CPU time
Cgroup-level CPU statistics currently include time spent on user/system processes, but do not include niced CPU time (despite already being tracked). This patch exposes niced CPU time to the userspace, allowing users to get a better understanding of their hardware limits and can facilitate more informed workload distribution.
A new field 'ntime' is added to struct cgroup_base_stat as opposed to struct task_cputime to minimize footprint.
Signed-off-by: Joshua Hahn <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7 |
|
| #
0e40cf2a |
| 05-Sep-2024 |
Kinsey Ho <[email protected]> |
cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
Patch series "Improve mem_cgroup_iter()", v4.
Incremental cgroup iteration is being used again [1]. This patchset improves th
cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
Patch series "Improve mem_cgroup_iter()", v4.
Incremental cgroup iteration is being used again [1]. This patchset improves the reliability of mem_cgroup_iter(). It also improves simplicity and code readability.
[1] https://lore.kernel.org/[email protected]/
This patch (of 5):
Explicitly document that css sibling/descendant linkage is protected by cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and similar functions that it isn't necessary to hold a ref on @pos.
The following changes in this patchset rely on this clarification for simplification in memcg iteration code.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Yosry Ahmed <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Kinsey Ho <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zefan Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: T.J. Mercier <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
c6f53ed8 |
| 29-Jul-2024 |
David Finkel <[email protected]> |
mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
This patch (of 2):
Other mechanisms for querying the peak memory usage of e
mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
This patch (of 2):
Other mechanisms for querying the peak memory usage of either a process or v1 memory cgroup allow for resetting the high watermark. Restore parity with those mechanisms, but with a less racy API.
For example: - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets the high watermark. - writing "5" to the clear_refs pseudo-file in a processes's proc directory resets the peak RSS.
This change is an evolution of a previous patch, which mostly copied the cgroup v1 behavior, however, there were concerns about races/ownership issues with a global reset, so instead this change makes the reset filedescriptor-local.
Writing any non-empty string to the memory.peak and memory.swap.peak pseudo-files reset the high watermark to the current usage for subsequent reads through that same FD.
Notably, following Johannes's suggestion, this implementation moves the O(FDs that have written) behavior onto the FD write(2) path. Instead, on the page-allocation path, we simply add one additional watermark to conditionally bump per-hierarchy level in the page-counter.
Additionally, this takes Longman's suggestion of nesting the page-charging-path checks for the two watermarks to reduce the number of common-case comparisons.
This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems.
Most notably, Vimeo's use-case involves a system that's doing global binpacking across many Kubernetes pods/containers, and while we can use PSI for some local decisions about overload, we strive to avoid packing workloads too tightly in the first place. To facilitate this, we track the peak memory usage. However, since we run with long-lived workers (to amortize startup costs) we need a way to track the high watermark while a work-item is executing. Polling runs the risk of missing short spikes that last for timescales below the polling interval, and peak memory tracking at the cgroup level is otherwise perfect for this use-case.
As this data is used to ensure that binpacked work ends up with sufficient headroom, this use-case mostly avoids the inaccuracies surrounding reclaimable memory.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Finkel <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Suggested-by: Waiman Long <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1 |
|
| #
ab031252 |
| 15-Jul-2024 |
Waiman Long <[email protected]> |
cgroup: Show # of subsystem CSSes in cgroup.stat
Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to help manage different structures in various cgroup subsystems by being an embed
cgroup: Show # of subsystem CSSes in cgroup.stat
Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to help manage different structures in various cgroup subsystems by being an embedded element inside a larger structure like cpuset or mem_cgroup.
The /proc/cgroups file shows the number of cgroups for each of the subsystems. With cgroup v1, the number of CSSes is the same as the number of cgroups. That is not the case anymore with cgroup v2. The /proc/cgroups file cannot show the actual number of CSSes for the subsystems that are bound to cgroup v2.
So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup), we can't tell by looking at /proc/cgroups which cgroup subsystems may be responsible.
As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical cgroup.stat file is now being extended to show the number of live and dying CSSes associated with all the non-inhibited cgroup subsystems that have been bound to cgroup v2. The number includes CSSes in the current cgroup as well as in all the descendants underneath it. This will help us pinpoint which subsystems are responsible for the increasing number of dying (nr_dying_descendants) cgroups.
The CSSes dying counts are stored in the cgroup structure itself instead of inside the CSS as suggested by Johannes. This will allow us to accurately track dying counts of cgroup subsystems that have recently been disabled in a cgroup. It is now possible that a zero subsystem number is coupled with a non-zero dying subsystem number.
The cgroup-v2.rst file is updated to discuss this new behavior.
With this patch applied, a sample output from root cgroup.stat file was shown below.
nr_descendants 56 nr_subsys_cpuset 1 nr_subsys_cpu 43 nr_subsys_io 43 nr_subsys_memory 56 nr_subsys_perf_event 57 nr_subsys_hugetlb 1 nr_subsys_pids 56 nr_subsys_rdma 1 nr_subsys_misc 1 nr_dying_descendants 30 nr_dying_subsys_cpuset 0 nr_dying_subsys_cpu 0 nr_dying_subsys_io 0 nr_dying_subsys_memory 30 nr_dying_subsys_perf_event 0 nr_dying_subsys_hugetlb 0 nr_dying_subsys_pids 0 nr_dying_subsys_rdma 0 nr_dying_subsys_misc 0
Another sample output from system.slice/cgroup.stat was:
nr_descendants 34 nr_subsys_cpuset 0 nr_subsys_cpu 32 nr_subsys_io 32 nr_subsys_memory 34 nr_subsys_perf_event 35 nr_subsys_hugetlb 0 nr_subsys_pids 34 nr_subsys_rdma 0 nr_subsys_misc 0 nr_dying_descendants 30 nr_dying_subsys_cpuset 0 nr_dying_subsys_cpu 0 nr_dying_subsys_io 0 nr_dying_subsys_memory 30 nr_dying_subsys_perf_event 0 nr_dying_subsys_hugetlb 0 nr_dying_subsys_pids 0 nr_dying_subsys_rdma 0 nr_dying_subsys_misc 0
Note that 'debug' controller wasn't used to provide this information because the controller is not recommended in productions kernels, also many of them won't enable CONFIG_CGROUP_DEBUG by default.
Similar information could be retrieved with debuggers like drgn but that's also not always available (e.g. lockdown) and the additional cost of runtime tracking here is deemed marginal.
tj: Added Michal's paragraphs on why this is not added the debug controller to the commit message.
Signed-off-by: Waiman Long <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Reviewed-by: Kamalesh Babulal <[email protected]> Cc: Michal Koutný <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
89ed6c9a |
| 16-Jul-2024 |
Xiu Jianfeng <[email protected]> |
blk-cgroup: move congestion_count to struct blkcg
The congestion_count was introduced into the struct cgroup by commit d09d8df3a294 ("blkcg: add generic throttling mechanism"), but since it is close
blk-cgroup: move congestion_count to struct blkcg
The congestion_count was introduced into the struct cgroup by commit d09d8df3a294 ("blkcg: add generic throttling mechanism"), but since it is closely related to the blkio subsys, it is not appropriate to put it in the struct cgroup, so let's move it to struct blkcg. There should be no functional changes because blkcg is per cgroup.
Signed-off-by: Xiu Jianfeng <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4 |
|
| #
47179fe0 |
| 12-Jun-2024 |
Xiu Jianfeng <[email protected]> |
mm/hugetlb_cgroup: prepare cftypes based on template
Unlike other cgroup subsystems, the hugetlb cgroup does not provide a static array of cftype that explicitly displays the properties, handling fu
mm/hugetlb_cgroup: prepare cftypes based on template
Unlike other cgroup subsystems, the hugetlb cgroup does not provide a static array of cftype that explicitly displays the properties, handling functions, etc., of each file. Instead, it dynamically creates the attribute of cftypes based on the hstate during the startup procedure. This reduces the readability of the code.
To fix this issue, introduce two templates of cftypes, and rebuild the attributes according to the hstate to make it ready to be added to cgroup framework.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Xiu Jianfeng <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: kernel test robot <[email protected]> From: Xiu Jianfeng <[email protected]> Subject: mm/hugetlb_cgroup: register lockdep key for cftype Date: Tue, 18 Jun 2024 07:19:22 +0000
When CONFIG_DEBUG_LOCK_ALLOC is enabled, the following commands can trigger a bug,
mount -t cgroup2 none /sys/fs/cgroup cd /sys/fs/cgroup echo "+hugetlb" > cgroup.subtree_control
The log is as below:
BUG: key ffff8880046d88d8 has not been registered! ------------[ cut here ]------------ DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 3 PID: 226 at kernel/locking/lockdep.c:4945 lockdep_init_map_type+0x185/0x220 Modules linked in: CPU: 3 PID: 226 Comm: bash Not tainted 6.10.0-rc4-next-20240617-g76db4c64526c #544 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:lockdep_init_map_type+0x185/0x220 Code: 00 85 c0 0f 84 6c ff ff ff 8b 3d 6a d1 85 01 85 ff 0f 85 5e ff ff ff 48 c7 c6 21 99 4a 82 48 c7 c7 60 29 49 82 e8 3b 2e f5 RSP: 0018:ffffc9000083fc30 EFLAGS: 00000282 RAX: 0000000000000000 RBX: ffffffff828dd820 RCX: 0000000000000027 RDX: ffff88803cd9cac8 RSI: 0000000000000001 RDI: ffff88803cd9cac0 RBP: ffff88800674fbb0 R08: ffffffff828ce248 R09: 00000000ffffefff R10: ffffffff8285e260 R11: ffffffff828b8eb8 R12: ffff8880046d88d8 R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880067281c0 FS: 00007f68601ea740(0000) GS:ffff88803cd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005614f3ebc740 CR3: 000000000773a000 CR4: 00000000000006f0 Call Trace: <TASK> ? __warn+0x77/0xd0 ? lockdep_init_map_type+0x185/0x220 ? report_bug+0x189/0x1a0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? lockdep_init_map_type+0x185/0x220 __kernfs_create_file+0x79/0x100 cgroup_addrm_files+0x163/0x380 ? find_held_lock+0x2b/0x80 ? find_held_lock+0x2b/0x80 ? find_held_lock+0x2b/0x80 css_populate_dir+0x73/0x180 cgroup_apply_control_enable+0x12f/0x3a0 cgroup_subtree_control_write+0x30b/0x440 kernfs_fop_write_iter+0x13a/0x1f0 vfs_write+0x341/0x450 ksys_write+0x64/0xe0 do_syscall_64+0x4b/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f68602d9833 Code: 8b 15 61 26 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 08 RSP: 002b:00007fff9bbdf8e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f68602d9833 RDX: 0000000000000009 RSI: 00005614f3ebc740 RDI: 0000000000000001 RBP: 00005614f3ebc740 R08: 000000000000000a R09: 0000000000000008 R10: 00005614f3db6ba0 R11: 0000000000000246 R12: 0000000000000009 R13: 00007f68603bd6a0 R14: 0000000000000009 R15: 00007f68603b8880
For lockdep, there is a sanity check in lockdep_init_map_type(), the lock-class key must either have been allocated statically or must have been registered as a dynamic key. However the commit e18df2889ff9 ("mm/hugetlb_cgroup: prepare cftypes based on template") has changed the cftypes from static allocated objects to dynamic allocated objects, so the cft->lockdep_key must be registered proactively.
[[email protected]: fix BUG()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/ Fixes: e18df2889ff9 ("mm/hugetlb_cgroup: prepare cftypes based on template") Signed-off-by: Xiu Jianfeng <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Tested-by: Marek Szyprowski <[email protected]> Tested-by: SeongJae Park <[email protected]> Closes: https://lore.kernel.org/[email protected] Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc3, v6.10-rc2, v6.10-rc1 |
|
| #
73e75e6f |
| 21-May-2024 |
Michal Koutný <[email protected]> |
cgroup/pids: Separate semantics of pids.events related to pids.max
Currently, when pids.max limit is breached in the hierarchy, the event is counted and reported in the cgroup where the forking task
cgroup/pids: Separate semantics of pids.events related to pids.max
Currently, when pids.max limit is breached in the hierarchy, the event is counted and reported in the cgroup where the forking task resides.
This decouples the limit and the notification caused by the limit making it hard to detect when the actual limit was effected.
Redefine the pids.events:max as: the number of times the limit of the cgroup was hit.
(Implementation differentiates also "forkfail" event but this is currently not exposed as it would better fit into pids.stat. It also differs from pids.events:max only when pids.max is configured on non-leaf cgroups.)
Since it changes semantics of the original "max" event, introduce this change only in the v2 API of the controller and add a cgroup2 mount option to revert to the legacy behavior.
Signed-off-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5 |
|
| #
a7fb0423 |
| 07-Dec-2023 |
Waiman Long <[email protected]> |
cgroup: Move rcu_head up near the top of cgroup_root
Commit d23b5c577715 ("cgroup: Make operations on the cgroup root_list RCU safe") adds a new rcu_head to the cgroup_root structure and kvfree_rcu(
cgroup: Move rcu_head up near the top of cgroup_root
Commit d23b5c577715 ("cgroup: Make operations on the cgroup root_list RCU safe") adds a new rcu_head to the cgroup_root structure and kvfree_rcu() for freeing the cgroup_root.
The current implementation of kvfree_rcu(), however, has the limitation that the offset of the rcu_head structure within the larger data structure must be less than 4096 or the compilation will fail. See the macro definition of __is_kvfree_rcu_offset() in include/linux/rcupdate.h for more information.
By putting rcu_head below the large cgroup structure, any change to the cgroup structure that makes it larger run the risk of causing build failure under certain configurations. Commit 77070eeb8821 ("cgroup: Avoid false cacheline sharing of read mostly rstat_cpu") happens to be the last straw that breaks it. Fix this problem by moving the rcu_head structure up before the cgroup structure.
Fixes: d23b5c577715 ("cgroup: Make operations on the cgroup root_list RCU safe") Reported-by: Stephen Rothwell <[email protected]> Closes: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Waiman Long <[email protected]> Acked-by: Yafang Shao <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc4 |
|
| #
77070eeb |
| 30-Nov-2023 |
Waiman Long <[email protected]> |
cgroup: Avoid false cacheline sharing of read mostly rstat_cpu
The rstat_cpu and also rstat_css_list of the cgroup structure are read mostly variables. However, they may share the same cacheline as
cgroup: Avoid false cacheline sharing of read mostly rstat_cpu
The rstat_cpu and also rstat_css_list of the cgroup structure are read mostly variables. However, they may share the same cacheline as the subsequent rstat_flush_next and *bstat variables which can be updated frequently. That will slow down the cgroup_rstat_cpu() call which is called pretty frequently in the rstat code. Add a CACHELINE_PADDING() line in between them to avoid false cacheline sharing.
A parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time. Below were the lock hold time frequency distribution before and after the patch:
Run time Before patch After patch -------- ------------ ----------- 0-01 us 9,928,562 9,820,428 01-05 us 110,151 50,935 05-10 us 270 93 10-15 us 273 146 15-20 us 135 76 20-25 us 0 2 25-30 us 1 0
It can be seen that the patch further pushes the lock hold time towards the lower end.
Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc3, v6.7-rc2, v6.7-rc1 |
|
| #
e76d28bd |
| 04-Nov-2023 |
Waiman Long <[email protected]> |
cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
When cgroup_rstat_updated() isn't being called concurrently with cgroup_rstat_flush_locked(), its run time is pretty short. Whe
cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
When cgroup_rstat_updated() isn't being called concurrently with cgroup_rstat_flush_locked(), its run time is pretty short. When both are called concurrently, the cgroup_rstat_updated() run time can spike to a pretty high value due to high cpu_lock hold time in cgroup_rstat_flush_locked(). This can be problematic if the task calling cgroup_rstat_updated() is a realtime task running on an isolated CPU with a strict latency requirement. The cgroup_rstat_updated() call can happen when there is a page fault even though the task is running in user space most of the time.
The percpu cpu_lock is used to protect the update tree - updated_next and updated_children. This protection is only needed when cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing operation which can take a much longer time does not need that protection as it is already protected by cgroup_rstat_lock.
To reduce the cpu_lock hold time, we need to perform all the cgroup_rstat_cpu_pop_updated() calls up front with the lock released afterward before doing any flushing. This patch adds a new cgroup_rstat_updated_list() function to return a singly linked list of cgroups to be flushed.
Some instrumentation code are added to measure the cpu_lock hold time right after lock acquisition to after releasing the lock. Parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time.
The maximum cpu_lock hold time before and after the patch are 100us and 29us respectively. So the worst case time is reduced to about 30% of the original. However, there may be some OS or hardware noises like NMI or SMI in the test system that can worsen the worst case value. Those noises are usually tuned out in a real production environment to get a better result.
OTOH, the lock hold time frequency distribution should give a better idea of the performance benefit of the patch. Below were the frequency distribution before and after the patch:
Hold time Before patch After patch --------- ------------ ----------- 0-01 us 804,139 13,738,708 01-05 us 9,772,767 1,177,194 05-10 us 4,595,028 4,984 10-15 us 303,481 3,562 15-20 us 78,971 1,314 20-25 us 24,583 18 25-30 us 6,908 12 30-40 us 8,015 40-50 us 2,192 50-60 us 316 60-70 us 43 70-80 us 7 80-90 us 2 >90 us 3
Signed-off-by: Waiman Long <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.6 |
|
| #
d23b5c57 |
| 29-Oct-2023 |
Yafang Shao <[email protected]> |
cgroup: Make operations on the cgroup root_list RCU safe
At present, when we perform operations on the cgroup root_list, we must hold the cgroup_mutex, which is a relatively heavyweight lock. In rea
cgroup: Make operations on the cgroup root_list RCU safe
At present, when we perform operations on the cgroup root_list, we must hold the cgroup_mutex, which is a relatively heavyweight lock. In reality, we can make operations on this list RCU-safe, eliminating the need to hold the cgroup_mutex during traversal. Modifications to the list only occur in the cgroup root setup and destroy paths, which should be infrequent in a production environment. In contrast, traversal may occur frequently. Therefore, making it RCU-safe would be beneficial.
Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc7, v6.6-rc6, v6.6-rc5 |
|
| #
8cba9576 |
| 06-Oct-2023 |
Nhat Pham <[email protected]> |
hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups
hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb folio is utilized, and uncharging when the folio is freed (analogous to the hugetlb controller). Note that we do not charge when the folio is allocated to the hugetlb pool, because at this point it is not owned by any memcg.
Some caveats to consider: * This feature is only available on cgroup v2. * There is no hugetlb pool management involved in the memory controller. As stated above, hugetlb folios are only charged towards the memory controller when it is used. Host overcommit management has to consider it when configuring hard limits. * Failure to charge towards the memcg results in SIGBUS. This could happen even if the hugetlb pool still has pages (but the cgroup limit is hit and reclaim attempt fails). * When this feature is enabled, hugetlb pages contribute to memory reclaim protection. low, min limits tuning must take into account hugetlb memory. * Hugetlb pages utilized while this option is not selected will not be tracked by the memory controller (even if cgroup v2 is remounted later on).
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Frank van der Linden <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Tejun heo <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
13cc9ee8 |
| 12-Oct-2023 |
Waiman Long <[email protected]> |
cgroup: Fix incorrect css_set_rwsem reference in comment
Since commit f0d9a5f17575 ("cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock"), css_set_rwsem has been replaced by css_set
cgroup: Fix incorrect css_set_rwsem reference in comment
Since commit f0d9a5f17575 ("cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock"), css_set_rwsem has been replaced by css_set_lock. That commit, however, missed the css_set_rwsem reference in include/linux/cgroup-defs.h. Fix that by changing it to css_set_lock as well.
Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6 |
|
| #
0437719c |
| 07-Aug-2023 |
Hao Jia <[email protected]> |
cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants
The member variable bstat of the structure cgroup_rstat_cpu records the per-cpu time of the cgroup itself, but does not
cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants
The member variable bstat of the structure cgroup_rstat_cpu records the per-cpu time of the cgroup itself, but does not include the per-cpu time of its descendants. The per-cpu time including descendants is very useful for calculating the per-cpu usage of cgroups.
Although we can indirectly obtain the total per-cpu time of the cgroup and its descendants by accumulating the per-cpu bstat of each descendant of the cgroup. But after a child cgroup is removed, we will lose its bstat information. This will cause the cumulative value to be non-monotonic, thus affecting the accuracy of cgroup per-cpu usage.
So we add the subtree_bstat variable to record the total per-cpu time of this cgroup and its descendants, which is similar to "cpuacct.usage*" in cgroup v1. And this is also helpful for the migration from cgroup v1 to cgroup v2. After adding this variable, we can obtain the per-cpu time of cgroup and its descendants in user mode through eBPF/drgn, etc. And we are still trying to determine how to expose it in the cgroupfs interface.
Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Hao Jia <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4 |
|
| #
677ea015 |
| 20-Jun-2023 |
Josh Don <[email protected]> |
sched: add throttled time stat for throttled children
We currently export the total throttled time for cgroups that are given a bandwidth limit. This patch extends this accounting to also account th
sched: add throttled time stat for throttled children
We currently export the total throttled time for cgroups that are given a bandwidth limit. This patch extends this accounting to also account the total time that each children cgroup has been throttled.
This is useful to understand the degree to which children have been affected by the throttling control. Children which are not runnable during the entire throttled period, for example, will not show any self-throttling time during this period.
Expose this in a new interface, 'cpu.stat.local', which is similar to how non-hierarchical events are accounted in 'memory.events.local'.
Signed-off-by: Josh Don <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3 |
|
| #
c4bcfb38 |
| 26-Oct-2022 |
Yonghong Song <[email protected]> |
bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Similar to sk/inode/task storage, implement similar cgroup local storage.
There already exists a local storage implementatio
bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Similar to sk/inode/task storage, implement similar cgroup local storage.
There already exists a local storage implementation for cgroup-attached bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper bpf_get_local_storage(). But there are use cases such that non-cgroup attached bpf progs wants to access cgroup local storage data. For example, tc egress prog has access to sk and cgroup. It is possible to use sk local storage to emulate cgroup local storage by storing data in socket. But this is a waste as it could be lots of sockets belonging to a particular cgroup. Alternatively, a separate map can be created with cgroup id as the key. But this will introduce additional overhead to manipulate the new map. A cgroup local storage, similar to existing sk/inode/task storage, should help for this use case.
The life-cycle of storage is managed with the life-cycle of the cgroup struct. i.e. the storage is destroyed along with the owning cgroup with a call to bpf_cgrp_storage_free() when cgroup itself is deleted.
The userspace map operations can be done by using a cgroup fd as a key passed to the lookup, update and delete operations.
Typically, the following code is used to get the current cgroup: struct task_struct *task = bpf_get_current_task_btf(); ... task->cgroups->dfl_cgrp ... and in structure task_struct definition: struct task_struct { .... struct css_set __rcu *cgroups; .... } With sleepable program, accessing task->cgroups is not protected by rcu_read_lock. So the current implementation only supports non-sleepable program and supporting sleepable program will be the next step together with adding rcu_read_lock protection for rcu tagged structures.
Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used for cgroup storage available to non-cgroup-attached bpf programs. The old cgroup storage supports bpf_get_local_storage() helper to get the cgroup data. The new cgroup storage helper bpf_cgrp_storage_get() can provide similar functionality. While old cgroup storage pre-allocates storage memory, the new mechanism can also pre-allocate with a user space bpf_map_update_elem() call to avoid potential run-time memory allocation failure. Therefore, the new cgroup storage can provide all functionality w.r.t. the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can be deprecated since the new one can provide the same functionality.
Acked-by: David Vernet <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5 |
|
| #
34f26a15 |
| 07-Sep-2022 |
Chengming Zhou <[email protected]> |
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhe
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhead for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable interface "cgroup.pressure", which is a read-write single value file that allowed values are "0" and "1", the defaults is "1" so per-cgroup PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable state aggregation, time tracking, averaging on a per-cgroup level, if we can live with losing history from while it was disabled. I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates, which is not implemented in this patch. So we always update groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <[email protected]> Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
8a693f77 |
| 06-Sep-2022 |
Tejun Heo <[email protected]> |
cgroup: Remove CFTYPE_PRESSURE
CFTYPE_PRESSURE is used to flag PSI related files so that they are not created if PSI is disabled during boot. It's a bit weird to use a generic flag to mark a specifi
cgroup: Remove CFTYPE_PRESSURE
CFTYPE_PRESSURE is used to flag PSI related files so that they are not created if PSI is disabled during boot. It's a bit weird to use a generic flag to mark a specific file type. Let's instead move the PSI files into its own cftypes array and add/rm them conditionally. This is a bit more code but cleaner.
No userland visible changes.
Signed-off-by: Tejun Heo <[email protected]> Cc: Johannes Weiner <[email protected]>
show more ...
|
| #
0083d27b |
| 06-Sep-2022 |
Tejun Heo <[email protected]> |
cgroup: Improve cftype add/rm error handling
Let's track whether a cftype is currently added or not using a new flag __CFTYPE_ADDED so that duplicate operations can be failed safely and consistently
cgroup: Improve cftype add/rm error handling
Let's track whether a cftype is currently added or not using a new flag __CFTYPE_ADDED so that duplicate operations can be failed safely and consistently allow using empty cftypes.
Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19 |
|
| #
7f203bc8 |
| 29-Jul-2022 |
Tejun Heo <[email protected]> |
cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[]
Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's no advantage to remembering the IDs instead of the pointers direc
cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[]
Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's no advantage to remembering the IDs instead of the pointers directly and this makes the array useless for finding an actual ancestor cgroup forcing cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up from cgroup_ancestor().
While at it, improve comments around cgroup_root->cgrp_ancestor_storage.
This patch shouldn't cause user-visible behavior differences.
v2: Update cgroup_ancestor() to use ->ancestors[].
v3: cgroup_root->cgrp_ancestor_storage's type is updated to match cgroup->ancestors[]. Better comments.
Signed-off-by: Tejun Heo <[email protected]> Acked-by: Namhyung Kim <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc8 |
|
| #
6a010a49 |
| 23-Jul-2022 |
Tejun Heo <[email protected]> |
cgroup: Make !percpu threadgroup_rwsem operations optional
3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
cgroup: Make !percpu threadgroup_rwsem operations optional
3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem because the impiled synchronize_rcu() on write locking was pushing up the latencies too much for android which constantly moves processes between cgroups.
This makes the hotter paths - fork and exit - slower as they're always forced into the slow path. There is no reason to force this on everyone especially given that more common static usage pattern can now completely avoid write-locking the rwsem. Write-locking is elided when turning on and off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a cgroup without grabbing the rwsem.
Restore the default percpu operations and introduce the mount option "favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need lower latencies for the dynamic operations.
Signed-off-by: Tejun Heo <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Michal Koutn� <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: John Stultz <[email protected]> Cc: Dmitry Shmidt <[email protected]> Cc: Oleg Nesterov <[email protected]>
show more ...
|