|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7 |
|
| #
94eed61d |
| 11-Jan-2025 |
Kaixiong Yu <[email protected]> |
fs: fs-writeback: move sysctl to fs/fs-writeback.c
The dirtytime_expire_interval belongs to fs/fs-writeback.c, move it to fs/fs-writeback.c from /kernel/sysctl.c. And remove the useless extern varia
fs: fs-writeback: move sysctl to fs/fs-writeback.c
The dirtytime_expire_interval belongs to fs/fs-writeback.c, move it to fs/fs-writeback.c from /kernel/sysctl.c. And remove the useless extern variable declaration and the function declaration from include/linux/writeback.h
Signed-off-by: Kaixiong Yu <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12 |
|
| #
8182a8b3 |
| 12-Nov-2024 |
Christoph Hellwig <[email protected]> |
writeback: wbc_attach_fdatawrite_inode out of line
This allows exporting this high-level interface only while keeping wbc_attach_and_unlock_inode private in fs-writeback.c and unexporting __inode_at
writeback: wbc_attach_fdatawrite_inode out of line
This allows exporting this high-level interface only while keeping wbc_attach_and_unlock_inode private in fs-writeback.c and unexporting __inode_attach_wb.
Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
4d7485cf |
| 12-Nov-2024 |
Christoph Hellwig <[email protected]> |
writeback: add a __releases annoation to wbc_attach_and_unlock_inode
This shuts up a sparse lock context tracking warning.
Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.or
writeback: add a __releases annoation to wbc_attach_and_unlock_inode
This shuts up a sparse lock context tracking warning.
Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1 |
|
| #
30dac24e |
| 26-Sep-2024 |
Pankaj Raghav <[email protected]> |
fs/writeback: convert wbc_account_cgroup_owner to take a folio
Most of the callers of wbc_account_cgroup_owner() are converting a folio to page before calling the function. wbc_account_cgroup_owner(
fs/writeback: convert wbc_account_cgroup_owner to take a folio
Most of the callers of wbc_account_cgroup_owner() are converting a folio to page before calling the function. wbc_account_cgroup_owner() is converting the page back to a folio to call mem_cgroup_css_from_folio().
Convert wbc_account_cgroup_owner() to take a folio instead of a page, and convert all callers to pass a folio directly except f2fs.
Convert the page to folio for all the callers from f2fs as they were the only callers calling wbc_account_cgroup_owner() with a page. As f2fs is already in the process of converting to folios, these call sites might also soon be calling wbc_account_cgroup_owner() with a folio directly in the future.
No functional changes. Only compile tested.
Signed-off-by: Pankaj Raghav <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: David Sterba <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5 |
|
| #
532980cb |
| 23-Aug-2024 |
Christian Brauner <[email protected]> |
inode: port __I_SYNC to var event
Port the __I_SYNC mechanism to use the new var event mechanism.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jos
inode: port __I_SYNC to var event
Port the __I_SYNC mechanism to use the new var event mechanism.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4 |
|
| #
57510c58 |
| 13-Aug-2024 |
Mateusz Guzik <[email protected]> |
vfs: drop one lock trip in evict()
Most commonly neither I_LRU_ISOLATING nor I_SYNC are set, but the stock kernel takes a back-to-back relock trip to check for them.
It probably can be avoided alto
vfs: drop one lock trip in evict()
Most commonly neither I_LRU_ISOLATING nor I_SYNC are set, but the stock kernel takes a back-to-back relock trip to check for them.
It probably can be avoided altogether, but for now massage things back to just one lock acquire.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Zhihao Cheng <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
c393eaa8 |
| 26-Jul-2024 |
Haifeng Xu <[email protected]> |
fs: don't flush in-flight wb switches for superblocks without cgroup writeback
When deactivating any type of superblock, it had to wait for the in-flight wb switches to be completed. wb switches are
fs: don't flush in-flight wb switches for superblocks without cgroup writeback
When deactivating any type of superblock, it had to wait for the in-flight wb switches to be completed. wb switches are executed in inode_switch_wbs_work_fn() which needs to acquire the wb_switch_rwsem and races against sync_inodes_sb(). If there are too much dirty data in the superblock, the waiting time may increase significantly.
For superblocks without cgroup writeback such as tmpfs, they have nothing to do with the wb swithes, so the flushing can be avoided.
Signed-off-by: Haifeng Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Suggested-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
78eb4ea2 |
| 24-Jul-2024 |
Joel Granados <[email protected]> |
sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ct
sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified.
This patch has been generated by the following coccinelle script:
``` virtual patch
@r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@ identifier func, ctl, write, buffer, lenp, ppos; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... }
@r3@ identifier func; @@
int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *);
@r4@ identifier func, ctl; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *);
@r5@ identifier func, write, buffer, lenp, ppos; @@
int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration.
Co-developed-by: Thomas Weißschuh <[email protected]> Signed-off-by: Thomas Weißschuh <[email protected]> Co-developed-by: Joel Granados <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7 |
|
| #
6a1ee871 |
| 28-Feb-2024 |
Kemeng Shi <[email protected]> |
fs/writeback: remove unnecessary return in writeback_inodes_sb
writeback_inodes_sb doesn't have return value, just remove unnecessary return in it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.
fs/writeback: remove unnecessary return in writeback_inodes_sb
writeback_inodes_sb doesn't have return value, just remove unnecessary return in it.
Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
ba679de9 |
| 28-Feb-2024 |
Kemeng Shi <[email protected]> |
fs/writeback: correct comment of __wakeup_flusher_threads_bdi
Commit e8e8a0c6c9bfc ("writeback: move nr_pages == 0 logic to one location") removed parameter nr_pages of __wakeup_flusher_threads_bdi
fs/writeback: correct comment of __wakeup_flusher_threads_bdi
Commit e8e8a0c6c9bfc ("writeback: move nr_pages == 0 logic to one location") removed parameter nr_pages of __wakeup_flusher_threads_bdi and we try to writeback all dirty pages in __wakeup_flusher_threads_bdi now. Just correct stale comment.
Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
639924ab |
| 28-Feb-2024 |
Kemeng Shi <[email protected]> |
fs/writeback: only calculate dirtied_before when b_io is empty
The dirtied_before is only used when b_io is not empty, so only calculate when b_io is not empty.
Signed-off-by: Kemeng Shi <shikemeng
fs/writeback: only calculate dirtied_before when b_io is empty
The dirtied_before is only used when b_io is not empty, so only calculate when b_io is not empty.
Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
2ddc9346 |
| 28-Feb-2024 |
Kemeng Shi <[email protected]> |
fs/writeback: remove unused parameter wb of finish_writeback_work
Remove unused parameter wb of finish_writeback_work.
Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kerne
fs/writeback: remove unused parameter wb of finish_writeback_work
Remove unused parameter wb of finish_writeback_work.
Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Tim Chen <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
d9210989 |
| 28-Feb-2024 |
Kemeng Shi <[email protected]> |
fs/writeback: bail out if there is no more inodes for IO and queued once
For case there is no more inodes for IO in io list from last wb_writeback, We may bail out early even there is inode in dirty
fs/writeback: bail out if there is no more inodes for IO and queued once
For case there is no more inodes for IO in io list from last wb_writeback, We may bail out early even there is inode in dirty list should be written back. Only bail out when we queued once to avoid missing dirtied inode.
This is from code reading...
Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> [[email protected]: fold in memory corruption fix from Jan in [1]] Link: https://lore.kernel.org/r/20240405132346.bid7gibby3lxxhez@quack3 [1] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
ac0c18f2 |
| 28-Feb-2024 |
Kemeng Shi <[email protected]> |
fs/writeback: avoid to writeback non-expired inode in kupdate writeback
In kupdate writeback, only expired inode (have been dirty for longer than dirty_expire_interval) is supposed to be written bac
fs/writeback: avoid to writeback non-expired inode in kupdate writeback
In kupdate writeback, only expired inode (have been dirty for longer than dirty_expire_interval) is supposed to be written back. However, kupdate writeback will writeback non-expired inode left in b_io or b_more_io from last wb_writeback. As a result, writeback will keep being triggered unexpected when we keep dirtying pages even dirty memory is under threshold and inode is not expired. To be more specific: Assume dirty background threshold is > 1G and dirty_expire_centisecs is > 60s. When we running fio -size=1G -invalidate=0 -ioengine=libaio --time_based -runtime=60... (keep dirtying), the writeback will keep being triggered as following: wb_workfn wb_do_writeback wb_check_background_flush /* * Wb dirty background threshold starts at 0 if device was idle and * grows up when bandwidth of wb is updated. So a background * writeback is triggered. */ wb_over_bg_thresh /* * Dirtied inode will be written back and added to b_more_io list * after slice used up (because we keep dirtying the inode). */ wb_writeback
Writeback is triggered per dirty_writeback_centisecs as following: wb_workfn wb_do_writeback wb_check_old_data_flush /* * Write back inode left in b_io and b_more_io from last wb_writeback * even the inode is non-expired and it will be added to b_more_io * again as slice will be used up (because we keep dirtying the * inode) */ wb_writeback
Fix this by moving non-expired inode to dirty list instead of more io list for kupdate writeback in requeue_inode.
Test as following: /* make it more easier to observe the issue */ echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 100 > /proc/sys/vm/dirty_writeback_centisecs /* create a idle device */ mkfs.ext4 -F /dev/vdb mount /dev/vdb /bdi1/ /* run buffer write with fio */ fio -name test -filename=/bdi1/file -size=800M -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0
Fio result before fix (run three tests): 1360MB/s 1329MB/s 1455MB/s
Fio result after fix (run three tests): 1737MB/s 1729MB/s 1789MB/s
Writeback for non-expired inode is gone as expeted. Observe this with trace writeback_start and writeback_written as following: echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_start/enab echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_written/enable cat /sys/kernel/tracing/trace_pipe
Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1 |
|
| #
12f7900c |
| 18-Jan-2024 |
Kemeng Shi <[email protected]> |
writeback: move wb_wakeup_delayed defination to fs-writeback.c
The wb_wakeup_delayed is only used in fs-writeback.c. Move it to fs-writeback.c after defination of wb_wakeup and make it static.
Sign
writeback: move wb_wakeup_delayed defination to fs-writeback.c
The wb_wakeup_delayed is only used in fs-writeback.c. Move it to fs-writeback.c after defination of wb_wakeup and make it static.
Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4 |
|
| #
c9c4ff12 |
| 27-Nov-2023 |
David Howells <[email protected]> |
netfs: Move pinning-for-writeback from fscache to netfs
Move the resource pinning-for-writeback from fscache code to netfslib code. This is used to keep a cache backing object pinned whilst we have
netfs: Move pinning-for-writeback from fscache to netfs
Move the resource pinning-for-writeback from fscache code to netfslib code. This is used to keep a cache backing object pinned whilst we have dirty pages on the netfs inode in the pagecache such that VM writeback will be able to reach it.
Whilst we're at it, switch the parameters of netfs_unpin_writeback() to match ->write_inode() so that it can be used for that directly.
Note that this mechanism could be more generically useful than that for network filesystems. Quite often they have to keep around other resources (e.g. authentication tokens or network connections) until the writeback is complete.
Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
show more ...
|
|
Revision tags: v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6 |
|
| #
6654408a |
| 14-Oct-2023 |
Jingbo Xu <[email protected]> |
writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
The cgwb cleanup routine will try to release the dying cgwb by switching the attached inodes. It fetches the attached i
writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
The cgwb cleanup routine will try to release the dying cgwb by switching the attached inodes. It fetches the attached inodes from wb->b_attached list, omitting the fact that inodes only with dirty timestamps reside in wb->b_dirty_time list, which is the case when lazytime is enabled. This causes enormous zombie memory cgroup when lazytime is enabled, as inodes with dirty timestamps can not be switched to a live cgwb for a long time.
It is reasonable not to switch cgwb for inodes with dirty data, as otherwise it may break the bandwidth restrictions. However since the writeback of inode metadata is not accounted for, let's also switch inodes with dirty timestamps to avoid zombie memory and block cgroups when laztytime is enabled.
Fixes: c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes") Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Jingbo Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2 |
|
| #
be049c3a |
| 16-Sep-2023 |
Chunhai Guo <[email protected]> |
fs-writeback: do not requeue a clean inode having skipped pages
When writing back an inode and performing an fsync on it concurrently, a deadlock issue may arise as shown below. In each writeback it
fs-writeback: do not requeue a clean inode having skipped pages
When writing back an inode and performing an fsync on it concurrently, a deadlock issue may arise as shown below. In each writeback iteration, a clean inode is requeued to the wb->b_dirty queue due to non-zero pages_skipped, without anything actually being written. This causes an infinite loop and prevents the plug from being flushed, resulting in a deadlock. We now avoid requeuing the clean inode to prevent this issue.
wb_writeback fsync (inode-Y) blk_start_plug(&plug) for (;;) { iter i-1: some reqs with page-X added into plug->mq_list // f2fs node page-X with PG_writeback filemap_fdatawrite __filemap_fdatawrite_range // write inode-Y with sync_mode WB_SYNC_ALL do_writepages f2fs_write_data_pages __f2fs_write_data_pages // wb_sync_req[DATA]++ for WB_SYNC_ALL f2fs_write_cache_pages f2fs_write_single_data_page f2fs_do_write_data_page f2fs_outplace_write_data f2fs_update_data_blkaddr f2fs_wait_on_page_writeback wait_on_page_writeback // wait for f2fs node page-X iter i: progress = __writeback_inodes_wb(wb, work) . writeback_sb_inodes . __writeback_single_inode // write inode-Y with sync_mode WB_SYNC_NONE . . do_writepages . . f2fs_write_data_pages . . . __f2fs_write_data_pages // skip writepages due to (wb_sync_req[DATA]>0) . . . wbc->pages_skipped += get_dirty_pages(inode) // wbc->pages_skipped = 1 . if (!(inode->i_state & I_DIRTY_ALL)) // i_state = I_SYNC | I_SYNC_QUEUED . total_wrote++; // total_wrote = 1 . requeue_inode // requeue inode-Y to wb->b_dirty queue due to non-zero pages_skipped if (progress) // progress = 1 continue; iter i+1: queue_io // similar process with iter i, infinite for-loop ! } blk_finish_plug(&plug) // flush plug won't be called
Signed-off-by: Chunhai Guo <[email protected]> Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc1, v6.5, v6.5-rc7 |
|
| #
d8ce82ef |
| 18-Aug-2023 |
Christian Brauner <[email protected]> |
super: make locking naming consistent
Make the naming consistent with the earlier introduced super_lock_{read,write}() helpers.
Reviewed-by: Jan Kara <[email protected]> Message-Id: <20230818-vfs-super-
super: make locking naming consistent
Make the naming consistent with the earlier introduced super_lock_{read,write}() helpers.
Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3 |
|
| #
2816ea2a |
| 21-Apr-2023 |
Yosry Ahmed <[email protected]> |
writeback: move wb_over_bg_thresh() call outside lock section
Patch series "cgroup: eliminate atomic rstat flushing", v5.
A previous patch series [1] changed most atomic rstat flushing contexts to
writeback: move wb_over_bg_thresh() call outside lock section
Patch series "cgroup: eliminate atomic rstat flushing", v5.
A previous patch series [1] changed most atomic rstat flushing contexts to become non-atomic. This was done to avoid an expensive operation that scales with # cgroups and # cpus to happen with irqs disabled and scheduling not permitted. There were two remaining atomic flushing contexts after that series. This series tries to eliminate them as well, eliminating atomic rstat flushing completely.
The two remaining atomic flushing contexts are: (a) wb_over_bg_thresh()->mem_cgroup_wb_stats() (b) mem_cgroup_threshold()->mem_cgroup_usage()
For (a), flushing needs to be atomic as wb_writeback() calls wb_over_bg_thresh() with a spinlock held. However, it seems like the call to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so this series proposes a refactoring that moves the call outside the lock criticial section and makes the stats flushing in mem_cgroup_wb_stats() non-atomic.
For (b), flushing needs to be atomic as mem_cgroup_threshold() is called with irqs disabled. We only flush the stats when calculating the root usage, as it is approximated as the sum of some memcg stats (file, anon, and optionally swap) instead of the conventional page counter. This series proposes changing this calculation to use the global stats instead, eliminating the need for a memcg stat flush.
After these 2 contexts are eliminated, we no longer need mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can remove them and simplify the code.
[1] https://lore.kernel.org/linux-mm/[email protected]/
This patch (of 5):
wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat flush, which can be expensive on large systems. Currently, wb_writeback() calls wb_over_bg_thresh() within a lock section, so we have to do the rstat flush atomically. On systems with a lot of cpus and/or cgroups, this can cause us to disable irqs for a long time, potentially causing problems.
Move the call to wb_over_bg_thresh() outside the lock section in preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic. The list_empty(&wb->work_list) check should be okay outside the lock section of wb->list_lock as it is protected by a separate lock (wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is modifying any of wb->b_* lists the wb->list_lock is protecting. Also, the loop seems to be already releasing and reacquring the lock, so this refactoring looks safe.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5 |
|
| #
3e46c89c |
| 19-Jan-2023 |
Maxim Korotkov <[email protected]> |
writeback: fix call of incorrect macro
the variable 'history' is of type u16, it may be an error that the hweight32 macro was used for it I guess macro hweight16 should be used
Found by Linux Ve
writeback: fix call of incorrect macro
the variable 'history' is of type u16, it may be an error that the hweight32 macro was used for it I guess macro hweight16 should be used
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 2a81490811d0 ("writeback: implement foreign cgroup inode detection") Signed-off-by: Maxim Korotkov <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
1ba1199e |
| 10-Apr-2023 |
Baokun Li <[email protected]> |
writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
KASAN report null-ptr-deref: ================================================================== BUG: KASAN: null-ptr-deref in bdi
writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
KASAN report null-ptr-deref: ================================================================== BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 Write of size 8 at addr 0000000000000000 by task sync/943 CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 Call Trace: <TASK> dump_stack_lvl+0x7f/0xc0 print_report+0x2ba/0x340 kasan_report+0xc4/0x120 kasan_check_range+0x1b7/0x2e0 __kasan_check_write+0x24/0x40 bdi_split_work_to_wbs+0x5c5/0x7b0 sync_inodes_sb+0x195/0x630 sync_inodes_one_sb+0x3a/0x50 iterate_supers+0x106/0x1b0 ksys_sync+0x98/0x160 [...] ==================================================================
The race that causes the above issue is as follows:
cpu1 cpu2 -------------------------|------------------------- inode_switch_wbs INIT_WORK(&isw->work, inode_switch_wbs_work_fn) queue_rcu_work(isw_wq, &isw->work) // queue_work async inode_switch_wbs_work_fn wb_put_many(old_wb, nr_switched) percpu_ref_put_many ref->data->release(ref) cgwb_release queue_work(cgwb_release_wq, &wb->release_work) // queue_work async &wb->release_work cgwb_release_workfn ksys_sync iterate_supers sync_inodes_one_sb sync_inodes_sb bdi_split_work_to_wbs kmalloc(sizeof(*work), GFP_ATOMIC) // alloc memory failed percpu_ref_exit ref->data = NULL kfree(data) wb_get(wb) percpu_ref_get(&wb->refcnt) percpu_ref_get_many(ref, 1) atomic_long_add(nr, &ref->data->count) atomic64_add(i, v) // trigger null-ptr-deref
bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. If the allocation of new work fails, the on-stack fallback will be used and the reference count of the current wb is increased afterwards. If cgroup writeback membership switches occur before getting the reference count and the current wb is released as old_wd, then calling wb_get() or wb_put() will trigger the null pointer dereference above.
This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger this issue. For scenarios called via sync_inodes_sb(), originally commit 7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback membership switches") reduced the possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes easily reproducible again.
To solve this problem, percpu_ref_exit() is called under RCU protection to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), and skip the current wb if wb_tryget() fails because the wb has already been shutdown.
Link: https://lkml.kernel.org/r/[email protected] Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") Signed-off-by: Baokun Li <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Hou Tao <[email protected]> Cc: yangerkun <[email protected]> Cc: Zhang Yi <[email protected]> Cc: Jens Axboe <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
75376c6f |
| 16-Jan-2023 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio()
Only one caller doesn't have a folio, so move the page_folio() call to that one caller from mem_cgroup_css_from_folio().
Link:
mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio()
Only one caller doesn't have a folio, so move the page_folio() call to that one caller from mem_cgroup_css_from_folio().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
9cfb816b |
| 16-Jan-2023 |
Matthew Wilcox (Oracle) <[email protected]> |
mm/fs: convert inode_attach_wb() to take a folio
Patch series "Writeback folio conversions".
Remove more calls to compound_head() by passing folios around instead of pages.
This patch (of 2):
Th
mm/fs: convert inode_attach_wb() to take a folio
Patch series "Writeback folio conversions".
Remove more calls to compound_head() by passing folios around instead of pages.
This patch (of 2):
The only caller of inode_attach_wb() which doesn't pass NULL already has a folio, so convert the whole call-chain to take folios.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1 |
|
| #
23e188a1 |
| 10-Dec-2022 |
Miaohe Lin <[email protected]> |
writeback: remove obsolete macro EXPIRE_DIRTY_ATIME
EXPIRE_DIRTY_ATIME is not used anymore. Remove it.
Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: ht
writeback: remove obsolete macro EXPIRE_DIRTY_ATIME
EXPIRE_DIRTY_ATIME is not used anymore. Remove it.
Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|