History log of /linux-6.15/include/linux/cgroup.h (Results 1 – 25 of 403)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1
# a22b3d54 30-Mar-2025 Waiman Long <[email protected]>

cgroup/cpuset: Fix race between newly created partition and dying one

There is a possible race between removing a cgroup diectory that is
a partition root and the creation of a new partition. The p

cgroup/cpuset: Fix race between newly created partition and dying one

There is a possible race between removing a cgroup diectory that is
a partition root and the creation of a new partition. The partition
to be removed can be dying but still online, it doesn't not currently
participate in checking for exclusive CPUs conflict, but the exclusive
CPUs are still there in subpartitions_cpus and isolated_cpus. These
two cpumasks are global states that affect the operation of cpuset
partitions. The exclusive CPUs in dying cpusets will only be removed
when cpuset_css_offline() function is called after an RCU delay.

As a result, it is possible that a new partition can be created with
exclusive CPUs that overlap with those of a dying one. When that dying
partition is finally offlined, it removes those overlapping exclusive
CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an
incorrect CPU configuration.

This bug was found when a warning was triggered in
remote_partition_disable() during testing because the subpartitions_cpus
mask was empty.

One possible way to fix this is to iterate the dying cpusets as well and
avoid using the exclusive CPUs in those dying cpusets. However, this
can still cause random partition creation failures or other anomalies
due to racing. A better way to fix this race is to reset the partition
state at the moment when a cpuset is being killed.

Introduce a new css_killed() CSS function pointer and call it, if
defined, before setting CSS_DYING flag in kill_css(). Also update the
css_is_dying() helper to use the CSS_DYING flag introduced by commit
33c35aa48178 ("cgroup: Prevent kill_css() from being called more than
once") for proper synchronization.

Add a new cpuset_css_killed() function to reset the partition state of
a valid partition root if it is being killed.

Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14
# 093c8812 19-Mar-2025 Yosry Ahmed <[email protected]>

cgroup: rstat: Cleanup flushing functions and locking

Now that the rstat lock is being re-acquired on every CPU iteration in
cgroup_rstat_flush_locked(), having the initially acquire the lock is
unn

cgroup: rstat: Cleanup flushing functions and locking

Now that the rstat lock is being re-acquired on every CPU iteration in
cgroup_rstat_flush_locked(), having the initially acquire the lock is
unnecessary and unclear.

Inline cgroup_rstat_flush_locked() into cgroup_rstat_flush() and move
the lock/unlock calls to the beginning and ending of the loop body to
make the critical section obvious.

cgroup_rstat_flush_hold/release() do not make much sense with the lock
being dropped and reacquired internally. Since it has no external
callers, remove it and explicitly acquire the lock in
cgroup_base_stat_cputime_show() instead.

This leaves the code with a single flushing function,
cgroup_rstat_flush().

Signed-off-by: Yosry Ahmed <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14-rc7
# 4a893bdc 11-Mar-2025 Michal Koutný <[email protected]>

blk-cgroup: Simplify policy files registration

Use one set of files when there is no difference between default and
legacy files, similar to regular subsys files registration. No
functional change.

blk-cgroup: Simplify policy files registration

Use one set of files when there is no difference between default and
legacy files, similar to regular subsys files registration. No
functional change.

Signed-off-by: Michal Koutný <[email protected]>
Acked-by: Jens Axboe <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2
# c6f53ed8 29-Jul-2024 David Finkel <[email protected]>

mm, memcg: cg2 memory{.swap,}.peak write handlers

Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.


This patch (of 2):

Other mechanisms for querying the peak memory usage of e

mm, memcg: cg2 memory{.swap,}.peak write handlers

Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.


This patch (of 2):

Other mechanisms for querying the peak memory usage of either a process or
v1 memory cgroup allow for resetting the high watermark. Restore parity
with those mechanisms, but with a less racy API.

For example:
- Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
the high watermark.
- writing "5" to the clear_refs pseudo-file in a processes's proc
directory resets the peak RSS.

This change is an evolution of a previous patch, which mostly copied the
cgroup v1 behavior, however, there were concerns about races/ownership
issues with a global reset, so instead this change makes the reset
filedescriptor-local.

Writing any non-empty string to the memory.peak and memory.swap.peak
pseudo-files reset the high watermark to the current usage for subsequent
reads through that same FD.

Notably, following Johannes's suggestion, this implementation moves the
O(FDs that have written) behavior onto the FD write(2) path. Instead, on
the page-allocation path, we simply add one additional watermark to
conditionally bump per-hierarchy level in the page-counter.

Additionally, this takes Longman's suggestion of nesting the
page-charging-path checks for the two watermarks to reduce the number of
common-case comparisons.

This behavior is particularly useful for work scheduling systems that need
to track memory usage of worker processes/cgroups per-work-item. Since
memory can't be squeezed like CPU can (the OOM-killer has opinions), these
systems need to track the peak memory usage to compute system/container
fullness when binpacking workitems.

Most notably, Vimeo's use-case involves a system that's doing global
binpacking across many Kubernetes pods/containers, and while we can use
PSI for some local decisions about overload, we strive to avoid packing
workloads too tightly in the first place. To facilitate this, we track
the peak memory usage. However, since we run with long-lived workers (to
amortize startup costs) we need a way to track the high watermark while a
work-item is executing. Polling runs the risk of missing short spikes
that last for timescales below the polling interval, and peak memory
tracking at the cgroup level is otherwise perfect for this use-case.

As this data is used to ensure that binpacked work ends up with sufficient
headroom, this use-case mostly avoids the inaccuracies surrounding
reclaimable memory.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Finkel <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Suggested-by: Waiman Long <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Michal Koutný <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2
# 7f36688f 28-May-2024 Yury Norov <[email protected]>

cpumask: cleanup core headers inclusion

Many core headers include cpumask.h for nothing. Drop it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yury No

cpumask: cleanup core headers inclusion

Many core headers include cpumask.h for nothing. Drop it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yury Norov <[email protected]>
Cc: Amit Daniel Kachhap <[email protected]>
Cc: Anna-Maria Behnsen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Yury Norov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 4f9c7ca8 18-Jun-2024 Tejun Heo <[email protected]>

sched: Factor out cgroup weight conversion functions

Factor out sched_weight_from/to_cgroup() which convert between scheduler
shares and cgroup weight. No functional change. The factored out functio

sched: Factor out cgroup weight conversion functions

Factor out sched_weight_from/to_cgroup() which convert between scheduler
shares and cgroup weight. No functional change. The factored out functions
will be used by a new BPF extensible sched_class so that the weights can be
exposed to the BPF programs in a way which is consistent cgroup weights and
easier to interpret.

The weight conversions will be used regardless of cgroup usage. It's just
borrowing the cgroup weight range as it's more intuitive.
CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that
the conversion helpers can always be defined.

v2: The helpers are now defined regardless of COFNIG_CGROUPS.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>

show more ...


Revision tags: v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5
# fc29e04a 16-Apr-2024 Jesper Dangaard Brouer <[email protected]>

cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints

This commit enhances the ability to troubleshoot the global
cgroup_rstat_lock by introducing wrapper helper functions for the lock
along w

cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints

This commit enhances the ability to troubleshoot the global
cgroup_rstat_lock by introducing wrapper helper functions for the lock
along with associated tracepoints.

Although global, the cgroup_rstat_lock helper APIs and tracepoints take
arguments such as cgroup pointer and cpu_in_loop variable. This
adjustment is made because flushing occurs per cgroup despite the lock
being global. Hence, when troubleshooting, it's important to identify the
relevant cgroup. The cpu_in_loop variable is necessary because the global
lock may be released within the main flushing loop that traverses CPUs.
In the tracepoints, the cpu_in_loop value is set to -1 when acquiring the
main lock; otherwise, it denotes the CPU number processed last.

The new feature in this patchset is detecting when lock is contended. The
tracepoints are implemented with production in mind. For minimum overhead
attach to cgroup:cgroup_rstat_lock_contended, which only gets activated
when trylock detects lock is contended. A quick production check for
issues could be done via this perf commands:

perf record -g -e cgroup:cgroup_rstat_lock_contended

Next natural question would be asking how long time do lock contenders
wait for obtaining the lock. This can be answered by measuring the time
between cgroup:cgroup_rstat_lock_contended and cgroup:cgroup_rstat_locked
when args->contended is set. Like this bpftrace script:

bpftrace -e '
tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs}
tracepoint:cgroup:cgroup_rstat_locked {
if (args->contended) {
@wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}}
interval:s:1 {time("%H:%M:%S "); print(@wait_ns); }'

Extending with time spend holding the lock will be more expensive as this
also looks at all the non-contended cases.
Like this bpftrace script:

bpftrace -e '
tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs}
tracepoint:cgroup:cgroup_rstat_locked { @locked[tid]=nsecs;
if (args->contended) {
@wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}}
tracepoint:cgroup:cgroup_rstat_unlock {
@locked_ns=hist(nsecs-@locked[tid]); delete(@locked[tid]);}
interval:s:1 {time("%H:%M:%S "); print(@wait_ns);print(@locked_ns); }'

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6
# aecd408b 29-Oct-2023 Yafang Shao <[email protected]>

cgroup: Add a new helper for cgroup1 hierarchy

A new helper is added for cgroup1 hierarchy:

- task_get_cgroup1
Acquires the associated cgroup of a task within a specific cgroup1
hierarchy. The

cgroup: Add a new helper for cgroup1 hierarchy

A new helper is added for cgroup1 hierarchy:

- task_get_cgroup1
Acquires the associated cgroup of a task within a specific cgroup1
hierarchy. The cgroup1 hierarchy is identified by its hierarchy ID.

This helper function is added to facilitate the tracing of tasks within
a particular container or cgroup dir in BPF programs. It's important to
note that this helper is designed specifically for cgroup1 only.

tj: Use irsqsave/restore as suggested by Hou Tao <[email protected]>.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Yafang Shao <[email protected]>
Cc: Hou Tao <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.6-rc7
# 6da88306 18-Oct-2023 Chuyi Zhou <[email protected]>

cgroup: Prepare for using css_task_iter_*() in BPF

This patch makes some preparations for using css_task_iter_*() in BPF
Program.

1. Flags CSS_TASK_ITER_* are #define-s and it's not easy for bpf pr

cgroup: Prepare for using css_task_iter_*() in BPF

This patch makes some preparations for using css_task_iter_*() in BPF
Program.

1. Flags CSS_TASK_ITER_* are #define-s and it's not easy for bpf prog to
use them. Convert them to enum so bpf prog can take them from vmlinux.h.

2. In the next patch we will add css_task_iter_*() in common kfuncs which
is not safe. Since css_task_iter_*() does spin_unlock_irq() which might
screw up irq flags depending on the context where bpf prog is running.
So we should use irqsave/irqrestore here and the switching is harmless.

Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Chuyi Zhou <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

show more ...


Revision tags: v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6
# d16b3af4 10-Jun-2023 Miaohe Lin <[email protected]>

cgroup: remove unused task_cgroup_path()

task_cgroup_path() is not used anymore. So remove it.

Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>


Revision tags: v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3
# 0a2dc6ac 21-Apr-2023 Yosry Ahmed <[email protected]>

cgroup: remove cgroup_rstat_flush_atomic()

Previous patches removed the only caller of cgroup_rstat_flush_atomic().
Remove the function and simplify the code.

Link: https://lkml.kernel.org/r/20230

cgroup: remove cgroup_rstat_flush_atomic()

Previous patches removed the only caller of cgroup_rstat_flush_atomic().
Remove the function and simplify the code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.3-rc7, v6.3-rc6, v6.3-rc5
# 8bff9a04 30-Mar-2023 Yosry Ahmed <[email protected]>

cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"

Patch series "memcg: avoid flushing stats atomically where possible", v3.

rstat flushing is an expensive operation that scales with the numbe

cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"

Patch series "memcg: avoid flushing stats atomically where possible", v3.

rstat flushing is an expensive operation that scales with the number of
cpus and the number of cgroups in the system. The purpose of this series
is to minimize the contexts where we flush stats atomically.

Patches 1 and 2 are cleanups requested during reviews of prior versions of
this series.

Patch 3 makes sure we never try to flush from within an irq context.

Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for
atomic and non-atomic flushing, and make sure we only flush the stats
atomically when necessary.

Patch 8 is a slightly tangential optimization that limits the work done by
rstat flushing in some scenarios.


This patch (of 8):

cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as
"irqs are disabled throughout", which is what the current implementation
does (currently under discussion [1]), but is not the intention. The
intention is that this function is safe to call from atomic contexts.
Name it as such.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1
# 4a7ba45b 08-Dec-2022 Tejun Heo <[email protected]>

memcg: fix possible use-after-free in memcg_write_event_control()

memcg_write_event_control() accesses the dentry->d_name of the specified
control fd to route the write call. As a cgroup interface

memcg: fix possible use-after-free in memcg_write_event_control()

memcg_write_event_control() accesses the dentry->d_name of the specified
control fd to route the write call. As a cgroup interface file can't be
renamed, it's safe to access d_name as long as the specified file is a
regular cgroup file. Also, as these cgroup interface files can't be
removed before the directory, it's safe to access the parent too.

Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a
call to __file_cft() which verified that the specified file is a regular
cgroupfs file before further accesses. The cftype pointer returned from
__file_cft() was no longer necessary and the commit inadvertently dropped
the file type check with it allowing any file to slip through. With the
invarients broken, the d_name and parent accesses can now race against
renames and removals of arbitrary files and cause use-after-free's.

Fix the bug by resurrecting the file type check in __file_cft(). Now that
cgroupfs is implemented through kernfs, checking the file operations needs
to go through a layer of indirection. Instead, let's check the superblock
and dentry type.

Link: https://lkml.kernel.org/r/Y5FRm/[email protected]
Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft")
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Jann Horn <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: <[email protected]> [3.14+]
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# fbf83212 08-Dec-2022 Tejun Heo <[email protected]>

memcg: Fix possible use-after-free in memcg_write_event_control()

memcg_write_event_control() accesses the dentry->d_name of the specified
control fd to route the write call. As a cgroup interface

memcg: Fix possible use-after-free in memcg_write_event_control()

memcg_write_event_control() accesses the dentry->d_name of the specified
control fd to route the write call. As a cgroup interface file can't be
renamed, it's safe to access d_name as long as the specified file is a
regular cgroup file. Also, as these cgroup interface files can't be
removed before the directory, it's safe to access the parent too.

Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a
call to __file_cft() which verified that the specified file is a regular
cgroupfs file before further accesses. The cftype pointer returned from
__file_cft() was no longer necessary and the commit inadvertently
dropped the file type check with it allowing any file to slip through.
With the invarients broken, the d_name and parent accesses can now race
against renames and removals of arbitrary files and cause
use-after-free's.

Fix the bug by resurrecting the file type check in __file_cft(). Now
that cgroupfs is implemented through kernfs, checking the file
operations needs to go through a layer of indirection. Instead, let's
check the superblock and dentry type.

Signed-off-by: Tejun Heo <[email protected]>
Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft")
Cc: [email protected] # v3.14+
Reported-by: Jann Horn <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v6.1-rc8
# c62256dd 30-Nov-2022 Jens Axboe <[email protected]>

Revert "blk-cgroup: Flush stats at blkgs destruction path"

This reverts commit dae590a6c96c799434e0ff8156ef29b88c257e60.

We've had a few reports on this causing a crash at boot time, because
of a r

Revert "blk-cgroup: Flush stats at blkgs destruction path"

This reverts commit dae590a6c96c799434e0ff8156ef29b88c257e60.

We've had a few reports on this causing a crash at boot time, because
of a reference issue. While this problem seemginly did exist before
the patch and needs solving separately, this patch makes it a lot
easier to trigger.

Link: https://lore.kernel.org/linux-block/CA+QYu4oxiRKC6hJ7F27whXy-PRBx=Tvb+-7TQTONN8qTtV3aDA@mail.gmail.com/
Link: https://lore.kernel.org/linux-block/[email protected]/
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4
# dae590a6 05-Nov-2022 Waiman Long <[email protected]>

blk-cgroup: Flush stats at blkgs destruction path

As noted by Michal, the blkg_iostat_set's in the lockless list
hold reference to blkg's to protect against their removal. Those
blkg's hold referenc

blk-cgroup: Flush stats at blkgs destruction path

As noted by Michal, the blkg_iostat_set's in the lockless list
hold reference to blkg's to protect against their removal. Those
blkg's hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which is
called when the blkcg reference count reaches 0. This circular dependency
will prevent blkcg from being freed until some other events cause
cgroup_rstat_flush() to be called to flush out the pending blkcg stats.

To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
function to flush stats for a given css and cpu and call it at the blkgs
destruction path, blkcg_destroy_blkgs(), whenever there are still some
pending stats to be flushed. This will ensure that blkcg reference
count can reach 0 ASAP.

Signed-off-by: Waiman Long <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# 79a7f41f 31-Oct-2022 Tejun Heo <[email protected]>

cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REF

6ab428604f72 ("cgroup: Implement DEBUG_CGROUP_REF") added a config option
which forces cgroup refcnt functions to be n

cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REF

6ab428604f72 ("cgroup: Implement DEBUG_CGROUP_REF") added a config option
which forces cgroup refcnt functions to be not inlined so that they can be
kprobed for debugging. However, it forgot export them when the config is
enabled breaking modules which make use of css reference counting.

Fix it by adding CGROUP_REF_EXPORT() macro to cgroup_refcnt.h which is
defined to EXPORT_SYMBOL_GPL when CONFIG_DEBUG_CGROUP_REF is set.

Signed-off-by: Tejun Heo <[email protected]>
Fixes: 6ab428604f72 ("cgroup: Implement DEBUG_CGROUP_REF")

show more ...


Revision tags: v6.1-rc3
# 6ab42860 28-Oct-2022 Tejun Heo <[email protected]>

cgroup: Implement DEBUG_CGROUP_REF

It's really difficult to debug when cgroup or css refs leak. Let's add a
debug option to force the refcnt function to not be inlined so that they can
be kprobed fo

cgroup: Implement DEBUG_CGROUP_REF

It's really difficult to debug when cgroup or css refs leak. Let's add a
debug option to force the refcnt function to not be inlined so that they can
be kprobed for debugging.

Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.1-rc2, v6.1-rc1
# a6d1ce59 11-Oct-2022 Yosry Ahmed <[email protected]>

cgroup: add cgroup_v1v2_get_from_[fd/file]()

Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that
support both cgroup1 and cgroup2.

Signed-off-by: Yosry Ahmed <[email protected]>

cgroup: add cgroup_v1v2_get_from_[fd/file]()

Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that
support both cgroup1 and cgroup2.

Signed-off-by: Yosry Ahmed <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.0, v6.0-rc7, v6.0-rc6
# 354ed597 18-Sep-2022 Yu Zhao <[email protected]>

mm: multi-gen LRU: kill switch

Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
0x0001: the multi-gen LRU core
0x0002: walking page table, when arch_

mm: multi-gen LRU: kill switch

Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
0x0001: the multi-gen LRU core
0x0002: walking page table, when arch_has_hw_pte_young() returns
true
0x0004: clearing the accessed bit in non-leaf PMD entries, when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
[yYnN]: apply to all the components above
E.g.,
echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005

NB: the page table walks happen on the scale of seconds under heavy memory
pressure, in which case the mmap_lock contention is a lesser concern,
compared with the LRU lock contention and the I/O congestion. So far the
only well-known case of the mmap_lock contention happens on Android, due
to Scudo [1] which allocates several thousand VMAs for merely a few
hundred MBs. The SPF and the Maple Tree also have provided their own
assessments [2][3]. However, if walking page tables does worsen the
mmap_lock contention, the kill switch can be used to disable it. In this
case the multi-gen LRU will suffer a minor performance degradation, as
shown previously.

Clearing the accessed bit in non-leaf PMD entries can also be disabled,
since this behavior was not tested on x86 varieties other than Intel and
AMD.

[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/r/[email protected]/
[3] https://lore.kernel.org/r/[email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michael Larabel <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.0-rc5, v6.0-rc4, v6.0-rc3
# 57899a66 25-Aug-2022 Chengming Zhou <[email protected]>

sched/psi: Consolidate cgroup_psi()

cgroup_psi() can't return psi_group for root cgroup, so we have many
open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".

This patch move cgroup_ps

sched/psi: Consolidate cgroup_psi()

cgroup_psi() can't return psi_group for root cgroup, so we have many
open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".

This patch move cgroup_psi() definition to <linux/psi.h>, in which
we can return psi_system for root cgroup, so can handle all cgroups.

Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# e2691f6b 28-Aug-2022 Tejun Heo <[email protected]>

cgroup: Implement cgroup_file_show()

Add cgroup_file_show() which allows toggling visibility of a cgroup file
using the new kernfs_show(). This will be used to hide psi interface files
on cgroups wh

cgroup: Implement cgroup_file_show()

Add cgroup_file_show() which allows toggling visibility of a cgroup file
using the new kernfs_show(). This will be used to hide psi interface files
on cgroups where it's disabled.

Cc: Chengming Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Tested-by: Chengming Zhou <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>

show more ...


# fa7e439c 26-Aug-2022 Michal Koutný <[email protected]>

cgroup: Homogenize cgroup_get_from_id() return value

Cgroup id is user provided datum hence extend its return domain to
include possible error reason (similar to cgroup_get_from_fd()).

This change

cgroup: Homogenize cgroup_get_from_id() return value

Cgroup id is user provided datum hence extend its return domain to
include possible error reason (similar to cgroup_get_from_fd()).

This change also fixes commit d4ccaf58a847 ("bpf: Introduce cgroup
iter") that would use NULL instead of proper error handling in
d4ccaf58a847 ("bpf: Introduce cgroup iter").

Additionally, neither of: fc_appid_store, bpf_iter_attach_cgroup,
mem_cgroup_get_from_ino (callers of cgroup_get_from_fd) is built without
CONFIG_CGROUPS (depends via CONFIG_BLK_CGROUP, direct, transitive
CONFIG_MEMCG respectively) transitive, so drop the singular definition
not needed with !CONFIG_CGROUPS.

Fixes: d4ccaf58a847 ("bpf: Introduce cgroup iter")
Signed-off-by: Michal Koutný <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v6.0-rc2, v6.0-rc1
# d7ae5818 06-Aug-2022 Hao Jia <[email protected]>

sched/psi: Remove redundant cgroup_psi() when !CONFIG_CGROUPS

cgroup_psi() is only called under CONFIG_CGROUPS.
We don't need cgroup_psi() when !CONFIG_CGROUPS,
so we can remove it in this case.

Si

sched/psi: Remove redundant cgroup_psi() when !CONFIG_CGROUPS

cgroup_psi() is only called under CONFIG_CGROUPS.
We don't need cgroup_psi() when !CONFIG_CGROUPS,
so we can remove it in this case.

Signed-off-by: Hao Jia <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>

show more ...


Revision tags: v5.19
# 7f203bc8 29-Jul-2022 Tejun Heo <[email protected]>

cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[]

Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's
no advantage to remembering the IDs instead of the pointers direc

cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[]

Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's
no advantage to remembering the IDs instead of the pointers directly and
this makes the array useless for finding an actual ancestor cgroup forcing
cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's
replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up
from cgroup_ancestor().

While at it, improve comments around cgroup_root->cgrp_ancestor_storage.

This patch shouldn't cause user-visible behavior differences.

v2: Update cgroup_ancestor() to use ->ancestors[].

v3: cgroup_root->cgrp_ancestor_storage's type is updated to match
cgroup->ancestors[]. Better comments.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Namhyung Kim <[email protected]>

show more ...


12345678910>>...17