|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2 |
|
| #
f814bdda |
| 23-Jan-2024 |
Jan Kara <[email protected]> |
blk-wbt: Fix detection of dirty-throttled tasks
The detection of dirty-throttled tasks in blk-wbt has been subtly broken since its beginning in 2016. Namely if we are doing cgroup writeback and the
blk-wbt: Fix detection of dirty-throttled tasks
The detection of dirty-throttled tasks in blk-wbt has been subtly broken since its beginning in 2016. Namely if we are doing cgroup writeback and the throttled task is not in the root cgroup, balance_dirty_pages() will set dirty_sleep for the non-root bdi_writeback structure. However blk-wbt checks dirty_sleep only in the root cgroup bdi_writeback structure. Thus detection of recently throttled tasks is not working in this case (we noticed this when we switched to cgroup v2 and suddently writeback was slow).
Since blk-wbt has no easy way to get to proper bdi_writeback and furthermore its intention has always been to work on the whole device rather than on individual cgroups, just move the dirty_sleep timestamp from bdi_writeback to backing_dev_info. That fixes the checking for recently throttled task and saves memory for everybody as a bonus.
CC: [email protected] Fixes: b57d74aff9ab ("writeback: track if we're sleeping on progress in balance_dirty_pages()") Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: fixup indentation errors] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8 |
|
| #
fa7d574b |
| 19-Jul-2022 |
Xiu Jianfeng <[email protected]> |
bdi: remove enum wb_congested_state
enum wb_congested_state and the member 'congested' in bdi_writeback are useless since commit a88f2096d5a2 ("remove congestion tracking framework"), so remove it.
bdi: remove enum wb_congested_state
enum wb_congested_state and the member 'congested' in bdi_writeback are useless since commit a88f2096d5a2 ("remove congestion tracking framework"), so remove it.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Xiu Jianfeng <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: NeilBrown <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1 |
|
| #
a88f2096 |
| 22-Mar-2022 |
NeilBrown <[email protected]> |
remove congestion tracking framework
This framework is no longer used - so discard it.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBro
remove congestion tracking framework
This framework is no longer used - so discard it.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Lars Ellenberg <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Paolo Valente <[email protected]> Cc: Philipp Reisner <[email protected]> Cc: Ryusuke Konishi <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1 |
|
| #
efee1713 |
| 05-Nov-2021 |
Christoph Hellwig <[email protected]> |
mm: simplify bdi refcounting
Move grabbing and releasing the bdi refcount out of the common wb_init/wb_exit helpers into code that is only used for the non-default memcg driven bdi_writeback structu
mm: simplify bdi refcounting
Move grabbing and releasing the bdi refcount out of the common wb_init/wb_exit helpers into code that is only used for the non-default memcg driven bdi_writeback structures.
[[email protected]: add comment] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix typo]
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Vignesh Raghavendra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1 |
|
| #
45a2966f |
| 02-Sep-2021 |
Jan Kara <[email protected]> |
writeback: fix bandwidth estimate for spiky workload
Michael Stapelberg has reported that for workload with short big spikes of writes (GCC linker seem to trigger this frequently) the write throughp
writeback: fix bandwidth estimate for spiky workload
Michael Stapelberg has reported that for workload with short big spikes of writes (GCC linker seem to trigger this frequently) the write throughput is heavily underestimated and tends to steadily sink until it reaches zero. This has rather bad impact on writeback throttling (causing stalls). The problem is that writeback throughput estimate gets updated at most once per 200 ms. One update happens early after we submit pages for writeback (at that point writeout of only small fraction of pages is completed and thus observed throughput is tiny). Next update happens only during the next write spike (updates happen only from inode writeback and dirty throttling code) and if that is more than 1s after previous spike, we decide system was idle and just ignore whatever was written until this moment.
Fix the problem by making sure writeback throughput estimate is also updated shortly after writeback completes to get reasonable estimate of throughput for spiky workloads.
[[email protected]: avoid division by 0 in wb_update_dirty_ratelimit()]
Link: https://lore.kernel.org/lkml/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reported-by: Michael Stapelberg <[email protected]> Tested-by: Michael Stapelberg <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
633a2abb |
| 02-Sep-2021 |
Jan Kara <[email protected]> |
writeback: track number of inodes under writeback
Patch series "writeback: Fix bandwidth estimates", v4.
Fix estimate of writeback throughput when device is not fully busy doing writeback. Michael
writeback: track number of inodes under writeback
Patch series "writeback: Fix bandwidth estimates", v4.
Fix estimate of writeback throughput when device is not fully busy doing writeback. Michael Stapelberg has reported that such workload (e.g. generated by linking) tends to push estimated throughput down to 0 and as a result writeback on the device is practically stalled.
The first three patches fix the reported issue, the remaining two patches are unrelated cleanups of problems I've noticed when reading the code.
This patch (of 4):
Track number of inodes under writeback for each bdi_writeback structure. We will use this to decide whether wb does any IO and so we can estimate its writeback throughput. In principle we could use number of pages under writeback (WB_WRITEBACK counter) for this however normal percpu counter reads are too inaccurate for our purposes and summing the counter is too expensive.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Michael Stapelberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1 |
|
| #
c22d70a1 |
| 29-Jun-2021 |
Roman Gushchin <[email protected]> |
writeback, cgroup: release dying cgwbs by switching attached inodes
Asynchronously try to release dying cgwbs by switching attached inodes to the nearest living ancestor wb. It helps to get rid of
writeback, cgroup: release dying cgwbs by switching attached inodes
Asynchronously try to release dying cgwbs by switching attached inodes to the nearest living ancestor wb. It helps to get rid of per-cgroup writeback structures themselves and of pinned memory and block cgroups, which are significantly larger structures (mostly due to large per-cpu statistics data). This prevents memory waste and helps to avoid different scalability problems caused by large piles of dying cgroups.
Reuse the existing mechanism of inode switching used for foreign inode detection. To speed things up batch up to 115 inode switching in a single operation (the maximum number is selected so that the resulting struct inode_switch_wbs_context can fit into 1024 bytes). Because every switching consists of two steps divided by an RCU grace period, it would be too slow without batching. Please note that the whole batch counts as a single operation (when increasing/decreasing isw_nr_in_flight). This allows to keep umounting working (flush the switching queue), however prevents cleanups from consuming the whole switching quota and effectively blocking the frn switching.
A cgwb cleanup operation can fail due to different reasons (e.g. not enough memory, the cgwb has an in-flight/pending io, an attached inode in a wrong state, etc). In this case the next scheduled cleanup will make a new attempt. An attempt is made each time a new cgwb is offlined (in other words a memcg and/or a blkcg is deleted by a user). In the future an additional attempt scheduled by a timer can be implemented.
[[email protected]: replace open-coded "115" with arithmetic] Link: https://lkml.kernel.org/r/YMEcSBcq/[email protected] [[email protected]: add smp_mb() to inode_prepare_wbs_switch()] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix documentation] Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Dennis Zhou <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
f5fbe6b7 |
| 29-Jun-2021 |
Roman Gushchin <[email protected]> |
writeback, cgroup: support switching multiple inodes at once
Currently only a single inode can be switched to another writeback structure at once. That means to switch an inode a separate inode_swi
writeback, cgroup: support switching multiple inodes at once
Currently only a single inode can be switched to another writeback structure at once. That means to switch an inode a separate inode_switch_wbs_context structure must be allocated, and a separate rcu callback and work must be scheduled.
It's fine for the existing ad-hoc switching, which is not happening that often, but sub-optimal for massive switching required in order to release a writeback structure. To prepare for it, let's add a support for switching multiple inodes at once.
Instead of containing a single inode pointer, inode_switch_wbs_context will contain a NULL-terminated array of inode pointers. inode_do_switch_wbs() will be called for each inode.
To optimize the locking bdi->wb_switch_rwsem, old_wb's and new_wb's list_locks will be acquired and released only once altogether for all inodes. wb_wakeup() will be also be called only once. Instead of calling wb_put(old_wb) after each successful switch, wb_put_many() is introduced and used.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Dennis Zhou <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
f3b6a6df |
| 29-Jun-2021 |
Roman Gushchin <[email protected]> |
writeback, cgroup: keep list of inodes attached to bdi_writeback
Currently there is no way to iterate over inodes attached to a specific cgwb structure. It limits the ability to efficiently reclaim
writeback, cgroup: keep list of inodes attached to bdi_writeback
Currently there is no way to iterate over inodes attached to a specific cgwb structure. It limits the ability to efficiently reclaim the writeback structure itself and associated memory and block cgroup structures without scanning all inodes belonging to a sb, which can be prohibitively expensive.
While dirty/in-active-writeback an inode belongs to one of the bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time. Once cleaned up, it's removed from all io lists. So the inode->i_io_list can be reused to maintain the list of inodes, attached to a bdi_writeback structure.
This patch introduces a new wb->b_attached list, which contains all inodes which were dirty at least once and are attached to the given cgwb. Inodes attached to the root bdi_writeback structures are never placed on such list. The following patch will use this list to try to release cgwbs structures more efficiently.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Suggested-by: Jan Kara <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Dennis Zhou <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4 |
|
| #
21cf8661 |
| 01-Jul-2020 |
Christoph Hellwig <[email protected]> |
writeback: remove bdi->congested_fn
Except for pktdvd, the only places setting congested bits are file systems that allocate their own backing_dev_info structures. And pktdvd is a deprecated driver
writeback: remove bdi->congested_fn
Except for pktdvd, the only places setting congested bits are file systems that allocate their own backing_dev_info structures. And pktdvd is a deprecated driver that isn't useful in stack setup either. So remove the dead congested_fn stacking infrastructure.
Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: David Sterba <[email protected]> [axboe: fixup unused variables in bcache/request.c] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
8c911f3d |
| 01-Jul-2020 |
Christoph Hellwig <[email protected]> |
writeback: remove struct bdi_writeback_congested
We never set any congested bits in the group writeback instances of it. And for the simpler bdi-wide case a simple scalar field is all that that is n
writeback: remove struct bdi_writeback_congested
We never set any congested bits in the group writeback instances of it. And for the simpler bdi-wide case a simple scalar field is all that that is needed.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
492d76b2 |
| 01-Jul-2020 |
Christoph Hellwig <[email protected]> |
writeback: remove {set,clear}_wb_congested
Just merge them into their only callers.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
|
|
Revision tags: v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5 |
|
| #
1cd925d5 |
| 04-May-2020 |
Christoph Hellwig <[email protected]> |
bdi: remove the name field in struct backing_dev_info
The name is only printed for a not registered bdi in writeback. Use the device name there as is more useful anyway for the unlike case that the
bdi: remove the name field in struct backing_dev_info
The name is only printed for a not registered bdi in writeback. Use the device name there as is more useful anyway for the unlike case that the warning triggers.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
6bd87eec |
| 04-May-2020 |
Christoph Hellwig <[email protected]> |
bdi: add a ->dev_name field to struct backing_dev_info
Cache a copy of the name for the life time of the backing_dev_info structure so that we can reference it even after unregistering.
Fixes: 68f2
bdi: add a ->dev_name field to struct backing_dev_info
Cache a copy of the name for the life time of the backing_dev_info structure so that we can reference it even after unregistering.
Fixes: 68f23b89067f ("memcg: fix a crash in wb_workfn when a device disappears") Reported-by: Yufen Yu <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.7-rc4, v5.7-rc3, v5.7-rc2 |
|
| #
c4b4c2a7 |
| 13-Apr-2020 |
Zhiqiang Liu <[email protected]> |
buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.
free_more_memory func has been completely removed in commit bc48f001de12 ("buffer: eliminate the need to call free_more_memory() i
buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.
free_more_memory func has been completely removed in commit bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()")
So comment and `WB_REASON_FREE_MORE_MEM` reason about free_more_memory are no longer needed.
Fixes: bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()") Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Zhiqiang Liu <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.7-rc1, v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3, v5.5-rc2, v5.5-rc1, v5.4, v5.4-rc8, v5.4-rc7, v5.4-rc6, v5.4-rc5, v5.4-rc4, v5.4-rc3, v5.4-rc2, v5.4-rc1, v5.3, v5.3-rc8, v5.3-rc7 |
|
| #
97b27821 |
| 26-Aug-2019 |
Tejun Heo <[email protected]> |
writeback, memcg: Implement foreign dirty flushing
There's an inherent mismatch between memcg and writeback. The former trackes ownership per-page while the latter per-inode. This was a deliberate
writeback, memcg: Implement foreign dirty flushing
There's an inherent mismatch between memcg and writeback. The former trackes ownership per-page while the latter per-inode. This was a deliberate design decision because honoring per-page ownership in the writeback path is complicated, may lead to higher CPU and IO overheads and deemed unnecessary given that write-sharing an inode across different cgroups isn't a common use-case.
Combined with inode majority-writer ownership switching, this works well enough in most cases but there are some pathological cases. For example, let's say there are two cgroups A and B which keep writing to different but confined parts of the same inode. B owns the inode and A's memory is limited far below B's. A's dirty ratio can rise enough to trigger balance_dirty_pages() sleeps but B's can be low enough to avoid triggering background writeback. A will be slowed down without a way to make writeback of the dirty pages happen.
This patch implements foreign dirty recording and foreign mechanism so that when a memcg encounters a condition as above it can trigger flushes on bdi_writebacks which can clean its pages. Please see the comment on top of mem_cgroup_track_foreign_dirty_slowpath() for details.
A reproducer follows.
write-range.c::
#include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h>
static const char *usage = "write-range FILE START SIZE\n";
int main(int argc, char **argv) { int fd; unsigned long start, size, end, pos; char *endp; char buf[4096];
if (argc < 4) { fprintf(stderr, usage); return 1; }
fd = open(argv[1], O_WRONLY); if (fd < 0) { perror("open"); return 1; }
start = strtoul(argv[2], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; }
size = strtoul(argv[3], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; }
end = start + size;
while (1) { for (pos = start; pos < end; ) { long bread, bwritten = 0;
if (lseek(fd, pos, SEEK_SET) < 0) { perror("lseek"); return 1; }
bread = read(0, buf, sizeof(buf) < end - pos ? sizeof(buf) : end - pos); if (bread < 0) { perror("read"); return 1; } if (bread == 0) return 0;
while (bwritten < bread) { long this;
this = write(fd, buf + bwritten, bread - bwritten); if (this < 0) { perror("write"); return 1; }
bwritten += this; pos += bwritten; } } } }
repro.sh::
#!/bin/bash
set -e set -x
sysctl -w vm.dirty_expire_centisecs=300000 sysctl -w vm.dirty_writeback_centisecs=300000 sysctl -w vm.dirtytime_expire_seconds=300000 echo 3 > /proc/sys/vm/drop_caches
TEST=/sys/fs/cgroup/test A=$TEST/A B=$TEST/B
mkdir -p $A $B echo "+memory +io" > $TEST/cgroup.subtree_control echo $((1<<30)) > $A/memory.high echo $((32<<30)) > $B/memory.high
rm -f testfile touch testfile fallocate -l 4G testfile
echo "Starting B"
(echo $BASHPID > $B/cgroup.procs pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &
echo "Waiting 10s to ensure B claims the testfile inode" sleep 5 sync sleep 5 sync echo "Starting A"
(echo $BASHPID > $A/cgroup.procs pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))
v2: Added comments explaining why the specific intervals are being used.
v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort flushing while avoding possible livelocks.
v4: Use get_jiffies_64() and time_before/after64() instead of raw jiffies_64 and arthimetic comparisons as suggested by Jan.
Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
34f8fe50 |
| 26-Aug-2019 |
Tejun Heo <[email protected]> |
bdi: Add bdi->id
There currently is no way to universally identify and lookup a bdi without holding a reference and pointer to it. This patch adds an non-recycling bdi->id and implements bdi_get_by
bdi: Add bdi->id
There currently is no way to universally identify and lookup a bdi without holding a reference and pointer to it. This patch adds an non-recycling bdi->id and implements bdi_get_by_id() which looks up bdis by their ids. This will be used by memcg foreign inode flushing.
I left bdi_list alone for simplicity and because while rb_tree does support rcu assignment it doesn't seem to guarantee lossless walk when walk is racing aginst tree rebalance operations.
Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
5b9cce4c |
| 26-Aug-2019 |
Tejun Heo <[email protected]> |
writeback: Generalize and expose wb_completion
wb_completion is used to track writeback completions. We want to use it from memcg side for foreign inode flushes. This patch updates it to remember
writeback: Generalize and expose wb_completion
wb_completion is used to track writeback completions. We want to use it from memcg side for foreign inode flushes. This patch updates it to remember the target waitq instead of assuming bdi->wb_waitq and expose it outside of fs-writeback.c.
Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.3-rc6, v5.3-rc5, v5.3-rc4, v5.3-rc3, v5.3-rc2, v5.3-rc1, v5.2, v5.2-rc7, v5.2-rc6, v5.2-rc5, v5.2-rc4, v5.2-rc3, v5.2-rc2, v5.2-rc1, v5.1, v5.1-rc7, v5.1-rc6, v5.1-rc5, v5.1-rc4, v5.1-rc3, v5.1-rc2, v5.1-rc1, v5.0, v5.0-rc8, v5.0-rc7, v5.0-rc6, v5.0-rc5, v5.0-rc4 |
|
| #
2d146b92 |
| 22-Jan-2019 |
Greg Kroah-Hartman <[email protected]> |
backing-dev: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code lo
backing-dev: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this.
And as the return value does not matter at all, no need to save the dentry in struct backing_dev_info, so delete it.
Cc: Andrew Morton <[email protected]> Cc: Anders Roxell <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Michal Hocko <[email protected]> Cc: [email protected] Reviewed-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v5.0-rc3, v5.0-rc2, v5.0-rc1, v4.20, v4.20-rc7, v4.20-rc6, v4.20-rc5, v4.20-rc4, v4.20-rc3, v4.20-rc2, v4.20-rc1, v4.19, v4.19-rc8, v4.19-rc7, v4.19-rc6, v4.19-rc5, v4.19-rc4, v4.19-rc3, v4.19-rc2, v4.19-rc1, v4.18, v4.18-rc8, v4.18-rc7, v4.18-rc6, v4.18-rc5, v4.18-rc4, v4.18-rc3, v4.18-rc2, v4.18-rc1, v4.17, v4.17-rc7, v4.17-rc6, v4.17-rc5, v4.17-rc4, v4.17-rc3, v4.17-rc2, v4.17-rc1, v4.16, v4.16-rc7, v4.16-rc6, v4.16-rc5, v4.16-rc4, v4.16-rc3, v4.16-rc2, v4.16-rc1, v4.15, v4.15-rc9, v4.15-rc8, v4.15-rc7, v4.15-rc6, v4.15-rc5, v4.15-rc4 |
|
| #
7fc5854f |
| 12-Dec-2017 |
Tejun Heo <[email protected]> |
writeback: synchronize sync(2) against cgroup writeback membership switches
sync_inodes_sb() can race against cgwb (cgroup writeback) membership switches and fail to writeback some inodes. For exam
writeback: synchronize sync(2) against cgroup writeback membership switches
sync_inodes_sb() can race against cgwb (cgroup writeback) membership switches and fail to writeback some inodes. For example, if an inode switches to another wb while sync_inodes_sb() is in progress, the new wb might not be visible to bdi_split_work_to_wbs() at all or the inode might jump from a wb which hasn't issued writebacks yet to one which already has.
This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb switch path against sync_inodes_sb() so that sync_inodes_sb() is guaranteed to see all the target wbs and inodes can't jump wbs to escape syncing.
v2: Fixed misplaced rwsem init. Spotted by Jiufei.
Signed-off-by: Tejun Heo <[email protected]> Reported-by: Jiufei Xue <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Jan Kara <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
368686a9 |
| 28-Dec-2018 |
Anders Roxell <[email protected]> |
writeback: don't decrement wb->refcnt if !wb->bdi
This happened while running in qemu-system-aarch64, the AMBA PL011 UART driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE. arch_initcall(pl011_in
writeback: don't decrement wb->refcnt if !wb->bdi
This happened while running in qemu-system-aarch64, the AMBA PL011 UART driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE. arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init), devtmpfs' handle_remove() crashes because the reference count is a NULL pointer only because wb->bdi hasn't been initialized yet.
Rework so that wb_put have an extra check if wb->bdi before decrement wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens again in other drivers.
Link: http://lkml.kernel.org/r/[email protected] Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Anders Roxell <[email protected]> Co-developed-by: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
347a28b5 |
| 30-Oct-2018 |
Anders Roxell <[email protected]> |
writeback: don't decrement wb->refcnt if !wb->bdi
This happened while running in qemu-system-aarch64, the AMBA PL011 UART driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE. arch_initcall(pl011_in
writeback: don't decrement wb->refcnt if !wb->bdi
This happened while running in qemu-system-aarch64, the AMBA PL011 UART driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE. arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init), devtmpfs' handle_remove() crashes because the reference count is a NULL pointer only because wb->bdi hasn't been initialized yet.
Rework so that wb_put have an extra check if wb->bdi before decrement wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens again in other drivers.
Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Co-developed-by: Arnd Bergmann <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Anders Roxell <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
| #
e58dd0de |
| 22-Aug-2018 |
Sebastian Andrzej Siewior <[email protected]> |
bdi: use refcount_t for reference counting instead atomic_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This permits a
bdi: use refcount_t for reference counting instead atomic_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This permits avoiding accidental refcounter overflows that might lead to use-after-free situations.
Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
3ee7e869 |
| 18-Jun-2018 |
Jan Kara <[email protected]> |
bdi: Fix another oops in wb_workfn()
syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was WB_shutting_down after wb
bdi: Fix another oops in wb_workfn()
syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was WB_shutting_down after wb->bdi->dev became NULL. This indicates that unregister_bdi() failed to call wb_shutdown() on one of wb objects.
The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's reference to wb structures before going through the list of wbs again and calling wb_shutdown() on each of them. This way the loop iterating through all wbs can easily miss a wb if that wb has already passed through cgwb_remove_from_bdi_list() called from wb_shutdown() from cgwb_release_workfn() and as a result fully shutdown bdi although wb_workfn() for this wb structure is still running. In fact there are also other ways cgwb_bdi_unregister() can race with cgwb_release_workfn() leading e.g. to use-after-free issues:
CPU1 CPU2 cgwb_bdi_unregister() cgwb_kill(*slot);
cgwb_release() queue_work(cgwb_release_wq, &wb->release_work); cgwb_release_workfn() wb = list_first_entry(&bdi->wb_list, ...) spin_unlock_irq(&cgwb_lock); wb_shutdown(wb); ... kfree_rcu(wb, rcu); wb_shutdown(wb); -> oops use-after-free
We solve these issues by synchronizing writeback structure shutdown from cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That way we also no longer need synchronization using WB_shutting_down as the mutex provides it for CONFIG_CGROUP_WRITEBACK case and without CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from bdi_unregister().
Reported-by: syzbot <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
2e898e4c |
| 20-Apr-2018 |
Greg Thelen <[email protected]> |
writeback: safer lock nesting
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if the page's memcg is undergoing move accounting, which occurs when a process leaves its memcg fo
writeback: safer lock nesting
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if the page's memcg is undergoing move accounting, which occurs when a process leaves its memcg for a new one that has memory.move_charge_at_immigrate set.
unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the given inode is switching writeback domains. Switches occur when enough writes are issued from a new domain.
This existing pattern is thus suspicious: lock_page_memcg(page); unlocked_inode_to_wb_begin(inode, &locked); ... unlocked_inode_to_wb_end(inode, locked); unlock_page_memcg(page);
If both inode switch and process memcg migration are both in-flight then unlocked_inode_to_wb_end() will unconditionally enable interrupts while still holding the lock_page_memcg() irq spinlock. This suggests the possibility of deadlock if an interrupt occurs before unlock_page_memcg().
truncate __cancel_dirty_page lock_page_memcg unlocked_inode_to_wb_begin unlocked_inode_to_wb_end <interrupts mistakenly enabled> <interrupt> end_page_writeback test_clear_page_writeback lock_page_memcg <deadlock> unlock_page_memcg
Due to configuration limitations this deadlock is not currently possible because we don't mix cgroup writeback (a cgroupv2 feature) and memory.move_charge_at_immigrate (a cgroupv1 feature).
If the kernel is hacked to always claim inode switching and memcg moving_account, then this script triggers lockup in less than a minute:
cd /mnt/cgroup/memory mkdir a b echo 1 > a/memory.move_charge_at_immigrate echo 1 > b/memory.move_charge_at_immigrate ( echo $BASHPID > a/cgroup.procs while true; do dd if=/dev/zero of=/mnt/big bs=1M count=256 done ) & while true; do sync done & sleep 1h & SLEEP=$! while true; do echo $SLEEP > a/cgroup.procs echo $SLEEP > b/cgroup.procs done
The deadlock does not seem possible, so it's debatable if there's any reason to modify the kernel. I suggest we should to prevent future surprises. And Wang Long said "this deadlock occurs three times in our environment", so there's more reason to apply this, even to stable. Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch see "[PATCH for-4.4] writeback: safer lock nesting" https://lkml.org/lkml/2018/4/11/146
Wang Long said "this deadlock occurs three times in our environment"
[[email protected]: v4] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: comment tweaks, struct initialization simplification] Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613 Link: http://lkml.kernel.org/r/[email protected] Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Signed-off-by: Greg Thelen <[email protected]> Reported-by: Wang Long <[email protected]> Acked-by: Wang Long <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: <[email protected]> [v4.2+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|