|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
c918f154 |
| 20-Mar-2025 |
Mateusz Guzik <[email protected]> |
fs: call inode_sb_list_add() outside of inode hash lock
As both locks are highly contended during significant inode churn, holding the inode hash lock while waiting for the sb list lock exacerbates
fs: call inode_sb_list_add() outside of inode hash lock
As both locks are highly contended during significant inode churn, holding the inode hash lock while waiting for the sb list lock exacerbates the problem.
Why moving it out is safe: the inode at hand still has I_NEW set and anyone who finds it through legitimate means waits for the bit to clear, by which time inode_sb_list_add() is guaranteed to have finished.
This significantly drops hash lock contention for me when stating 20 separate trees in parallel, each with 1000 directories * 1000 files.
However, no speed up was observed as contention increased on the other locks, notably dentry LRU.
Even so, removal of the lock ordering will help making this faster later.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
5a607aa9 |
| 19-Mar-2025 |
Mateusz Guzik <[email protected]> |
fs: load the ->i_sb pointer once in inode_sb_list_{add,del}
While this may sound like a pedantic clean up, it does in fact impact code generation -- the patched add routine is slightly smaller.
Sig
fs: load the ->i_sb pointer once in inode_sb_list_{add,del}
While this may sound like a pedantic clean up, it does in fact impact code generation -- the patched add routine is slightly smaller.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
eb7e453a |
| 17-Mar-2025 |
Mateusz Guzik <[email protected]> |
fs: drop the lock trip around I_NEW wake up in evict()
The unhashed state check in __wait_on_freeing_inode() performed with ->i_lock held against remove_hash_inode() also holding the lock makes anot
fs: drop the lock trip around I_NEW wake up in evict()
The unhashed state check in __wait_on_freeing_inode() performed with ->i_lock held against remove_hash_inode() also holding the lock makes another lock acquire in evict() completely spurious -- all potential sleepers already dropped the lock before remove_hash_inode() acquired it or they found the inode to be unhashed and aborted.
Note there is no trickery here: the usual cost of both sides taking locks is still being paid, it just stops being paid twice.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6 |
|
| #
93fd0d46 |
| 07-Mar-2025 |
Jan Kara <[email protected]> |
vfs: Remove invalidate_inodes()
The function can be replaced by evict_inodes. The only difference is that evict_inodes() skips the inodes with positive refcount without touching ->i_lock, but they a
vfs: Remove invalidate_inodes()
The function can be replaced by evict_inodes. The only difference is that evict_inodes() skips the inodes with positive refcount without touching ->i_lock, but they are equivalent as evict_inodes() repeats the refcount check after having grabbed ->i_lock.
Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
1479be62 |
| 12-Feb-2025 |
Mateusz Guzik <[email protected]> |
vfs: inline new_inode_pseudo() and de-staticize alloc_inode()
The former is a no-op wrapper with the same argument.
I left it in place to not lose the information who needs it -- one day "pseudo" i
vfs: inline new_inode_pseudo() and de-staticize alloc_inode()
The former is a no-op wrapper with the same argument.
I left it in place to not lose the information who needs it -- one day "pseudo" inodes may start differing from what alloc_inode() returns.
In the meantime no point taking a detour.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2 |
|
| #
8b17e540 |
| 09-Feb-2025 |
Mateusz Guzik <[email protected]> |
vfs: add initial support for CONFIG_DEBUG_VFS
Small collection of macros taken from mmdebug.h
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/20250209185523.745956-
vfs: add initial support for CONFIG_DEBUG_VFS
Small collection of macros taken from mmdebug.h
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1 |
|
| #
1751f872 |
| 28-Jan-2025 |
Joel Granados <[email protected]> |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysc
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers.
Created this by running an spatch followed by a sed command: Spatch: virtual patch
@ depends on !(file in "net") disable optional_qualifier @
identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@
+ const struct ctl_table table_name [] = { ... };
sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c
Reviewed-by: Song Liu <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> # for kernel/trace/ Reviewed-by: Martin K. Petersen <[email protected]> # SCSI Reviewed-by: Darrick J. Wong <[email protected]> # xfs Acked-by: Jani Nikula <[email protected]> Acked-by: Corey Minyard <[email protected]> Acked-by: Wei Liu <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Reviewed-by: Bill O'Donnell <[email protected]> Acked-by: Baoquan He <[email protected]> Acked-by: Ashutosh Dixit <[email protected]> Acked-by: Anna Schumaker <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12 |
|
| #
9fed2c0f |
| 13-Nov-2024 |
Jeff Layton <[email protected]> |
fs: reduce pointer chasing in is_mgtime() test
The is_mgtime test checks whether the FS_MGTIME flag is set in the fstype. To get there from the inode though, we have to dereference 3 pointers.
Add
fs: reduce pointer chasing in is_mgtime() test
The is_mgtime test checks whether the FS_MGTIME flag is set in the fstype. To get there from the inode though, we have to dereference 3 pointers.
Add a new IOP_MGTIME flag, and have inode_init_always() set that flag when the fstype flag is set. Then, make is_mgtime test for IOP_MGTIME instead.
Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
45c9faf5 |
| 13-Nov-2024 |
Mateusz Guzik <[email protected]> |
vfs: make evict() use smp_mb__after_spinlock instead of smp_mb
It literally directly follows a spin_lock() call.
This whacks an explicit barrier on x86-64.
Signed-off-by: Mateusz Guzik <mjguzik@gm
vfs: make evict() use smp_mb__after_spinlock instead of smp_mb
It literally directly follows a spin_lock() call.
This whacks an explicit barrier on x86-64.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc7 |
|
| #
da0c0251 |
| 04-Nov-2024 |
Kairui Song <[email protected]> |
mm/list_lru: simplify the list_lru walk callback function
Now isolation no longer takes the list_lru global node lock, only use the per-cgroup lock instead. And this lock is inside the list_lru_one
mm/list_lru: simplify the list_lru walk callback function
Now isolation no longer takes the list_lru global node lock, only use the per-cgroup lock instead. And this lock is inside the list_lru_one being walked, no longer needed to pass the lock explicitly.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Waiman Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
fb56fdf8 |
| 04-Nov-2024 |
Kairui Song <[email protected]> |
mm/list_lru: split the lock to per-cgroup scope
Currently, every list_lru has a per-node lock that protects adding, deletion, isolation, and reparenting of all list_lru_one instances belonging to th
mm/list_lru: split the lock to per-cgroup scope
Currently, every list_lru has a per-node lock that protects adding, deletion, isolation, and reparenting of all list_lru_one instances belonging to this list_lru on this node. This lock contention is heavy when multiple cgroups modify the same list_lru.
This lock can be split into per-cgroup scope to reduce contention.
To achieve this, we need a stable list_lru_one for every cgroup. This commit adds a lock to each list_lru_one and introduced a helper function lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg. Then reworked the reparenting process.
Reparenting will switch the list_lru_one instances one by one. By locking each instance and marking it dead using the nr_items counter, reparenting ensures that all items in the corresponding cgroup (on-list or not, because items have a stable cgroup, see below) will see the list_lru_one switch synchronously.
Objcg reparent is also moved after list_lru reparent so items will have a stable mem cgroup until all list_lru_one instances are drained.
The only caller that doesn't work the *_obj interfaces are direct calls to list_lru_{add,del}. But it's only used by zswap and that's also based on objcg, so it's fine.
This also changes the bahaviour of the isolation function when LRU_RETRY or LRU_REMOVED_RETRY is returned, because now releasing the lock could unblock reparenting and free the list_lru_one, isolation function will have to return withoug re-lock the lru.
prepare() { mkdir /tmp/test-fs modprobe brd rd_nr=1 rd_size=33554432 mkfs.xfs -f /dev/ram0 mount -t xfs /dev/ram0 /tmp/test-fs for i in $(seq 1 512); do mkdir "/tmp/test-fs/$i" for j in $(seq 1 10240); do echo TEST-CONTENT > "/tmp/test-fs/$i/$j" done & done; wait }
do_test() { read_worker() { sleep 1 tar -cv "$1" &>/dev/null } read_in_all() { cd "/tmp/test-fs" && ls for i in $(seq 1 512); do (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs" read_worker "$i" & done; wait } for i in $(seq 1 512); do mkdir -p "/sys/fs/cgroup/benchmark/$i" done echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control echo 512M > /sys/fs/cgroup/benchmark/memory.max echo 3 > /proc/sys/vm/drop_caches time read_in_all }
Above script simulates compression of small files in multiple cgroups with memory pressure. Run prepare() then do_test for 6 times:
Before: real 0m7.762s user 0m11.340s sys 3m11.224s real 0m8.123s user 0m11.548s sys 3m2.549s real 0m7.736s user 0m11.515s sys 3m11.171s real 0m8.539s user 0m11.508s sys 3m7.618s real 0m7.928s user 0m11.349s sys 3m13.063s real 0m8.105s user 0m11.128s sys 3m14.313s
After this commit (about ~15% faster): real 0m6.953s user 0m11.327s sys 2m42.912s real 0m7.453s user 0m11.343s sys 2m51.942s real 0m6.916s user 0m11.269s sys 2m43.957s real 0m6.894s user 0m11.528s sys 2m45.346s real 0m6.911s user 0m11.095s sys 2m43.168s real 0m6.773s user 0m11.518s sys 2m40.774s
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Waiman Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3 |
|
| #
80d3ab22 |
| 08-Oct-2024 |
Andrew Kreimer <[email protected]> |
fs/inode: Fix a typo
Fix a typo in comments: wether v-> whether.
Signed-off-by: Andrew Kreimer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-
fs/inode: Fix a typo
Fix a typo in comments: wether v-> whether.
Signed-off-by: Andrew Kreimer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc2 |
|
| #
c2986387 |
| 04-Oct-2024 |
Andreas Gruenbacher <[email protected]> |
vfs: inode insertion kdoc corrections
Some minor corrections to the inode_insert5 and iget5_locked kernel documentation.
Signed-off-by: Andreas Gruenbacher <[email protected]> Link: https://lore.
vfs: inode insertion kdoc corrections
Some minor corrections to the inode_insert5 and iget5_locked kernel documentation.
Signed-off-by: Andreas Gruenbacher <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
73a47cf4 |
| 02-Oct-2024 |
Jeff Layton <[email protected]> |
fs: add percpu counters for significant multigrain timestamp events
New percpu counters for counting various stats around multigrain timestamp events, and a new debugfs file for displaying them when
fs: add percpu counters for significant multigrain timestamp events
New percpu counters for counting various stats around multigrain timestamp events, and a new debugfs file for displaying them when CONFIG_DEBUG_FS is enabled:
- number of attempted ctime updates - number of successful i_ctime_nsec swaps - number of fine-grained timestamp fetches - number of floor value swap events
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Jan Kara <[email protected]> Tested-by: Randy Dunlap <[email protected]> # documentation bits Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
c86e3c47 |
| 02-Oct-2024 |
Jeff Layton <[email protected]> |
fs: tracepoints around multigrain timestamp events
Add some tracepoints around various multigrain timestamp events.
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Darrick J. Wong <djw
fs: tracepoints around multigrain timestamp events
Add some tracepoints around various multigrain timestamp events.
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Tested-by: Randy Dunlap <[email protected]> # documentation bits Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
7f2c86cb |
| 02-Oct-2024 |
Jeff Layton <[email protected]> |
fs: handle delegated timestamps in setattr_copy_mgtime
An update to the inode ctime typically requires the latest clock value possible. The exception to this rule is when there is a nfsd write deleg
fs: handle delegated timestamps in setattr_copy_mgtime
An update to the inode ctime typically requires the latest clock value possible. The exception to this rule is when there is a nfsd write delegation and the server is proxying timestamps from the client.
When nfsd gets a CB_GETATTR response, update the timestamp value in the inode to the values that the client is tracking. The client doesn't send a ctime value (since that's always determined by the exported filesystem), but it can send a mtime value. In the case where it does, update the ctime to a value commensurate with that instead of the current time.
If ATTR_DELEG is set, then use ia_ctime value instead of setting the timestamp to the current time.
With the addition of delegated timestamps, the server may receive a request to update only the atime, which doesn't involve a ctime update. Trust the ATTR_CTIME flag in the update and only update the ctime when it's set.
Tested-by: Randy Dunlap <[email protected]> # documentation bits Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1 |
|
| #
9897713f |
| 26-Sep-2024 |
Michal Hocko <[email protected]> |
bcachefs: do not use PF_MEMALLOC_NORECLAIM
Patch series "remove PF_MEMALLOC_NORECLAIM" v3.
This patch (of 2):
bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new inode to achi
bcachefs: do not use PF_MEMALLOC_NORECLAIM
Patch series "remove PF_MEMALLOC_NORECLAIM" v3.
This patch (of 2):
bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new inode to achieve GFP_NOWAIT semantic while holding locks. If this allocation fails it will drop locks and use GFP_NOFS allocation context.
We would like to drop PF_MEMALLOC_NORECLAIM because it is really dangerous to use if the caller doesn't control the full call chain with this flag set. E.g. if any of the function down the chain needed GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and cause unexpected failure.
While this is not the case in this particular case using the scoped gfp semantic is not really needed bacause we can easily pus the allocation context down the chain without too much clutter.
[[email protected]: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Jan Kara <[email protected]> # For vfs changes Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: James Morris <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Paul Moore <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Yafang Shao <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
4e40eff0 |
| 02-Oct-2024 |
Jeff Layton <[email protected]> |
fs: add infrastructure for multigrain timestamps
The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to o
fs: add infrastructure for multigrain timestamps
The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications).
If fine-grained timestamps were always used, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates.
What is needed is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, allow the update to use a fine-grained timestamp iff it's necessary to make the ctime show a different value.
If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, accept that value. If it isn't, then get a fine-grained timestamp and attempt to stamp the inode ctime with that value. If that races with another concurrent stamp, then abandon the update and take the new value without retrying.
Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems).
Tested-by: Randy Dunlap <[email protected]> # documentation bits Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3 |
|
| #
88d2ae0e |
| 08-Aug-2024 |
Kent Overstreet <[email protected]> |
inode: make __iget() a static inline
bcachefs is switching to an rhashtable for vfs inodes instead of the standard inode.c hashtable, so we need this exported, or - a static inline makes more sense
inode: make __iget() a static inline
bcachefs is switching to an rhashtable for vfs inodes instead of the standard inode.c hashtable, so we need this exported, or - a static inline makes more sense for a single atomic_inc().
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
7f7b8506 |
| 26-Aug-2024 |
Li Zhijian <[email protected]> |
fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
It's observed that a crash occurs during hot-remove a memory device, in which user is accessing the hugetlb. See calltrace as fo
fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
It's observed that a crash occurs during hot-remove a memory device, in which user is accessing the hugetlb. See calltrace as following:
------------[ cut here ]------------ WARNING: CPU: 1 PID: 14045 at arch/x86/mm/fault.c:1278 do_user_addr_fault+0x2a0/0x790 Modules linked in: kmem device_dax cxl_mem cxl_pmem cxl_port cxl_pci dax_hmem dax_pmem nd_pmem cxl_acpi nd_btt cxl_core crc32c_intel nvme virtiofs fuse nvme_core nfit libnvdimm dm_multipath scsi_dh_rdac scsi_dh_emc s mirror dm_region_hash dm_log dm_mod CPU: 1 PID: 14045 Comm: daxctl Not tainted 6.10.0-rc2-lizhijian+ #492 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:do_user_addr_fault+0x2a0/0x790 Code: 48 8b 00 a8 04 0f 84 b5 fe ff ff e9 1c ff ff ff 4c 89 e9 4c 89 e2 be 01 00 00 00 bf 02 00 00 00 e8 b5 ef 24 00 e9 42 fe ff ff <0f> 0b 48 83 c4 08 4c 89 ea 48 89 ee 4c 89 e7 5b 5d 41 5c 41 5d 41 RSP: 0000:ffffc90000a575f0 EFLAGS: 00010046 RAX: ffff88800c303600 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000001000 RSI: ffffffff82504162 RDI: ffffffff824b2c36 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90000a57658 R13: 0000000000001000 R14: ffff88800bc2e040 R15: 0000000000000000 FS: 00007f51cb57d880(0000) GS:ffff88807fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000001000 CR3: 00000000072e2004 CR4: 00000000001706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn+0x8d/0x190 ? do_user_addr_fault+0x2a0/0x790 ? report_bug+0x1c3/0x1d0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x14/0x70 ? asm_exc_invalid_op+0x16/0x20 ? do_user_addr_fault+0x2a0/0x790 ? exc_page_fault+0x31/0x200 exc_page_fault+0x68/0x200 <...snip...> BUG: unable to handle page fault for address: 0000000000001000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 800000000ad92067 P4D 800000000ad92067 PUD 7677067 PMD 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI ---[ end trace 0000000000000000 ]--- BUG: unable to handle page fault for address: 0000000000001000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 800000000ad92067 P4D 800000000ad92067 PUD 7677067 PMD 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 14045 Comm: daxctl Kdump: loaded Tainted: G W 6.10.0-rc2-lizhijian+ #492 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:dentry_name+0x1f4/0x440 <...snip...> ? dentry_name+0x2fa/0x440 vsnprintf+0x1f3/0x4f0 vprintk_store+0x23a/0x540 vprintk_emit+0x6d/0x330 _printk+0x58/0x80 dump_mapping+0x10b/0x1a0 ? __pfx_free_object_rcu+0x10/0x10 __dump_page+0x26b/0x3e0 ? vprintk_emit+0xe0/0x330 ? _printk+0x58/0x80 ? dump_page+0x17/0x50 dump_page+0x17/0x50 do_migrate_range+0x2f7/0x7f0 ? do_migrate_range+0x42/0x7f0 ? offline_pages+0x2f4/0x8c0 offline_pages+0x60a/0x8c0 memory_subsys_offline+0x9f/0x1c0 ? lockdep_hardirqs_on+0x77/0x100 ? _raw_spin_unlock_irqrestore+0x38/0x60 device_offline+0xe3/0x110 state_store+0x6e/0xc0 kernfs_fop_write_iter+0x143/0x200 vfs_write+0x39f/0x560 ksys_write+0x65/0xf0 do_syscall_64+0x62/0x130
Previously, some sanity check have been done in dump_mapping() before the print facility parsing '%pd' though, it's still possible to run into an invalid dentry.d_name.name.
Since dump_mapping() only needs to dump the filename only, retrieve it by itself in a safer way to prevent an unnecessary crash.
Note that either retrieving the filename with '%pd' or strncpy_from_kernel_nofault(), the filename could be unreliable.
Signed-off-by: Li Zhijian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
f469e6e6 |
| 23-Aug-2024 |
Christian Brauner <[email protected]> |
inode: port __I_LRU_ISOLATING to var event
Port the __I_LRU_ISOLATING mechanism to use the new var event mechanism.
Link: https://lore.kernel.org/r/[email protected]
inode: port __I_LRU_ISOLATING to var event
Port the __I_LRU_ISOLATING mechanism to use the new var event mechanism.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
88b1afbf |
| 23-Aug-2024 |
Julian Sun <[email protected]> |
vfs: fix race between evice_inodes() and find_inode()&iput()
Hi, all
Recently I noticed a bug[1] in btrfs, after digged it into and I believe it'a race in vfs.
Let's assume there's a inode (ie ino
vfs: fix race between evice_inodes() and find_inode()&iput()
Hi, all
Recently I noticed a bug[1] in btrfs, after digged it into and I believe it'a race in vfs.
Let's assume there's a inode (ie ino 261) with i_count 1 is called by iput(), and there's a concurrent thread calling generic_shutdown_super().
cpu0: cpu1: iput() // i_count is 1 ->spin_lock(inode) ->dec i_count to 0 ->iput_final() generic_shutdown_super() ->__inode_add_lru() ->evict_inodes() // cause some reason[2] ->if (atomic_read(inode->i_count)) continue; // return before // inode 261 passed the above check // list_lru_add_obj() // and then schedule out ->spin_unlock() // note here: the inode 261 // was still at sb list and hash list, // and I_FREEING|I_WILL_FREE was not been set
btrfs_iget() // after some function calls ->find_inode() // found the above inode 261 ->spin_lock(inode) // check I_FREEING|I_WILL_FREE // and passed ->__iget() ->spin_unlock(inode) // schedule back ->spin_lock(inode) // check (I_NEW|I_FREEING|I_WILL_FREE) flags, // passed and set I_FREEING iput() ->spin_unlock(inode) ->spin_lock(inode) ->evict() // dec i_count to 0 ->iput_final() ->spin_unlock() ->evict()
Now, we have two threads simultaneously evicting the same inode, which may trigger the BUG(inode->i_state & I_CLEAR) statement both within clear_inode() and iput().
To fix the bug, recheck the inode->i_count after holding i_lock. Because in the most scenarios, the first check is valid, and the overhead of spin_lock() can be reduced.
If there is any misunderstanding, please let me know, thanks.
[1]: https://lore.kernel.org/linux-btrfs/[email protected]/ [2]: The reason might be 1. SB_ACTIVE was removed or 2. mapping_shrinkable() return false when I reproduced the bug.
Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=67ba3c42bcbb4665d3ad CC: [email protected] Fixes: 63997e98a3be ("split invalidate_inodes()") Signed-off-by: Julian Sun <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
0fe340a9 |
| 23-Aug-2024 |
Christian Brauner <[email protected]> |
inode: port __I_NEW to var event
Port the __I_NEW mechanism to use the new var event mechanism.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef
inode: port __I_NEW to var event
Port the __I_NEW mechanism to use the new var event mechanism.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
da18ecbf |
| 23-Aug-2024 |
Christian Brauner <[email protected]> |
fs: add i_state helpers
The i_state member is an unsigned long so that it can be used with the wait bit infrastructure which expects unsigned long. This wastes 4 bytes which we're unlikely to ever u
fs: add i_state helpers
The i_state member is an unsigned long so that it can be used with the wait bit infrastructure which expects unsigned long. This wastes 4 bytes which we're unlikely to ever use. Switch to using the var event wait mechanism using the address of the bit. Thanks to Linus for the address idea.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
57510c58 |
| 13-Aug-2024 |
Mateusz Guzik <[email protected]> |
vfs: drop one lock trip in evict()
Most commonly neither I_LRU_ISOLATING nor I_SYNC are set, but the stock kernel takes a back-to-back relock trip to check for them.
It probably can be avoided alto
vfs: drop one lock trip in evict()
Most commonly neither I_LRU_ISOLATING nor I_SYNC are set, but the stock kernel takes a back-to-back relock trip to check for them.
It probably can be avoided altogether, but for now massage things back to just one lock acquire.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Zhihao Cheng <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|