|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3 |
|
| #
777d0961 |
| 17-Apr-2025 |
Christoph Hellwig <[email protected]> |
fs: move the bdex_statx call to vfs_getattr_nosec
Currently bdex_statx is only called from the very high-level vfs_statx_path function, and thus bypassing it for in-kernel calls to vfs_getattr or vf
fs: move the bdex_statx call to vfs_getattr_nosec
Currently bdex_statx is only called from the very high-level vfs_statx_path function, and thus bypassing it for in-kernel calls to vfs_getattr or vfs_getattr_nosec.
This breaks querying the block ѕize of the underlying device in the loop driver and also is a pitfall for any other new kernel caller.
Move the call into the lowest level helper to ensure all callers get the right results.
Fixes: 2d985f8c6b91 ("vfs: support STATX_DIOALIGN on block devices") Fixes: f4774e92aab8 ("loop: take the file system minimum dio alignment into account") Reported-by: "Darrick J. Wong" <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13 |
|
| #
0fac3ed4 |
| 19-Jan-2025 |
Su Hui <[email protected]> |
fs/stat.c: avoid harmless garbage value problem in vfs_statx_path()
Clang static checker(scan-build) warning: fs/stat.c:287:21: warning: The left expression of the compound assignment is an uninitia
fs/stat.c: avoid harmless garbage value problem in vfs_statx_path()
Clang static checker(scan-build) warning: fs/stat.c:287:21: warning: The left expression of the compound assignment is an uninitialized value. The computed value will also be garbage. 287 | stat->result_mask |= STATX_MNT_ID_UNIQUE; | ~~~~~~~~~~~~~~~~~ ^ fs/stat.c:290:21: warning: The left expression of the compound assignment is an uninitialized value. The computed value will also be garbage. 290 | stat->result_mask |= STATX_MNT_ID;
When vfs_getattr() failed because of security_inode_getattr(), 'stat' is uninitialized. In this case, there is a harmless garbage problem in vfs_statx_path(). It's better to return error directly when vfs_getattr() failed, avoiding garbage value and more clearly.
Signed-off-by: Su Hui <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc7 |
|
| #
7ed6cbe0 |
| 09-Jan-2025 |
Christoph Hellwig <[email protected]> |
fs: add STATX_DIO_READ_ALIGN
Add a separate dio read align field, as many out of place write file systems can easily do reads aligned to the device sector size, but require bigger alignment for writ
fs: add STATX_DIO_READ_ALIGN
Add a separate dio read align field, as many out of place write file systems can easily do reads aligned to the device sector size, but require bigger alignment for writes.
This is usually papered over by falling back to buffered I/O for smaller writes and doing read-modify-write cycles, but performance for this sucks, so applications benefit from knowing the actual write alignment.
Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: John Garry <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6 |
|
| #
95f567f8 |
| 01-Nov-2024 |
Stefan Berger <[email protected]> |
fs: Simplify getattr interface function checking AT_GETATTR_NOSEC flag
Commit 8a924db2d7b5 ("fs: Pass AT_GETATTR_NOSEC flag to getattr interface function")' introduced the AT_GETATTR_NOSEC flag to e
fs: Simplify getattr interface function checking AT_GETATTR_NOSEC flag
Commit 8a924db2d7b5 ("fs: Pass AT_GETATTR_NOSEC flag to getattr interface function")' introduced the AT_GETATTR_NOSEC flag to ensure that the call paths only call vfs_getattr_nosec if it is set instead of vfs_getattr. Now, simplify the getattr interface functions of filesystems where the flag AT_GETATTR_NOSEC is checked.
There is only a single caller of inode_operations getattr function and it is located in fs/stat.c in vfs_getattr_nosec. The caller there is the only one from which the AT_GETATTR_NOSEC flag is passed from.
Two filesystems are checking this flag in .getattr and the flag is always passed to them unconditionally from only vfs_getattr_nosec:
- ecryptfs: Simplify by always calling vfs_getattr_nosec in ecryptfs_getattr. From there the flag is passed to no other function and this function is not called otherwise.
- overlayfs: Simplify by always calling vfs_getattr_nosec in ovl_getattr. From there the flag is passed to no other function and this function is not called otherwise.
The query_flags in vfs_getattr_nosec will mask-out AT_GETATTR_NOSEC from any caller using AT_STATX_SYNC_TYPE as mask so that the flag is not important inside this function. Also, since no filesystem is checking the flag anymore, remove the flag entirely now, including the BUG_ON check that never triggered.
The net change of the changes here combined with the original commit is that ecryptfs and overlayfs do not call vfs_getattr but only vfs_getattr_nosec.
Fixes: 8a924db2d7b5 ("fs: Pass AT_GETATTR_NOSEC flag to getattr interface function") Reported-by: Al Viro <[email protected]> Closes: https://lore.kernel.org/linux-fsdevel/20241101011724.GN1350452@ZenIV/T/#u Cc: Tyler Hicks <[email protected]> Cc: [email protected] Cc: Miklos Szeredi <[email protected]> Cc: Amir Goldstein <[email protected]> Cc: [email protected] Cc: Christian Brauner <[email protected]> Cc: [email protected] Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Stefan Berger <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc5, v6.12-rc4 |
|
| #
0dd4fb73 |
| 20-Oct-2024 |
Al Viro <[email protected]> |
fs/stat.c: switch to CLASS(fd_raw)
... and use fd_empty() consistently
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
|
|
Revision tags: v6.12-rc3, v6.12-rc2, v6.12-rc1 |
|
| #
88a20626 |
| 20-Sep-2024 |
Al Viro <[email protected]> |
kill getname_statx_lookup_flags()
LOOKUP_EMPTY is ignored by the only remaining user, and without that 'getname_' prefix makes no sense.
Remove LOOKUP_EMPTY part, rename to statx_lookup_flags() and
kill getname_statx_lookup_flags()
LOOKUP_EMPTY is ignored by the only remaining user, and without that 'getname_' prefix makes no sense.
Remove LOOKUP_EMPTY part, rename to statx_lookup_flags() and make static. It most likely is _not_ statx() specific, either, but that's the next step.
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
e896474f |
| 08-Oct-2024 |
Al Viro <[email protected]> |
getname_maybe_null() - the third variant of pathname copy-in
Semantics used by statx(2) (and later *xattrat(2)): without AT_EMPTY_PATH it's standard getname() (i.e. ERR_PTR(-ENOENT) on empty string,
getname_maybe_null() - the third variant of pathname copy-in
Semantics used by statx(2) (and later *xattrat(2)): without AT_EMPTY_PATH it's standard getname() (i.e. ERR_PTR(-ENOENT) on empty string, ERR_PTR(-EFAULT) on NULL), with AT_EMPTY_PATH both empty string and NULL are accepted.
Calling conventions: getname_maybe_null(user_pointer, flags) returns * pointer to struct filename when non-empty string had been successfully read * ERR_PTR(...) on error * NULL if an empty string or NULL pointer had been given with AT_EMPTY_PATH in the flags argument.
It tries to avoid allocation in the last case; it's not always able to do so, in which case the temporary struct filename instance is freed and NULL returned anyway.
Fast path is inlined.
Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
c86e3c47 |
| 02-Oct-2024 |
Jeff Layton <[email protected]> |
fs: tracepoints around multigrain timestamp events
Add some tracepoints around various multigrain timestamp events.
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Darrick J. Wong <djw
fs: tracepoints around multigrain timestamp events
Add some tracepoints around various multigrain timestamp events.
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Tested-by: Randy Dunlap <[email protected]> # documentation bits Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
4e40eff0 |
| 02-Oct-2024 |
Jeff Layton <[email protected]> |
fs: add infrastructure for multigrain timestamps
The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to o
fs: add infrastructure for multigrain timestamps
The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications).
If fine-grained timestamps were always used, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates.
What is needed is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, allow the update to use a fine-grained timestamp iff it's necessary to make the ctime show a different value.
If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, accept that value. If it isn't, then get a fine-grained timestamp and attempt to stamp the inode ctime with that value. If that races with another concurrent stamp, then abandon the update and take the new value without retrying.
Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems).
Tested-by: Randy Dunlap <[email protected]> # documentation bits Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
1da91ea8 |
| 31-May-2024 |
Al Viro <[email protected]> |
introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are ve
introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict]
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
0ef625bb |
| 25-Jun-2024 |
Mateusz Guzik <[email protected]> |
vfs: support statx(..., NULL, AT_EMPTY_PATH, ...)
The newly used helper also checks for empty ("") paths.
NULL paths with any flag value other than AT_EMPTY_PATH go the usual route and end up with
vfs: support statx(..., NULL, AT_EMPTY_PATH, ...)
The newly used helper also checks for empty ("") paths.
NULL paths with any flag value other than AT_EMPTY_PATH go the usual route and end up with -EFAULT to retain compatibility (Rust is abusing calls of the sort to detect availability of statx).
This avoids path lookup code, lockref management, memory allocation and in case of NULL path userspace memory access (which can be quite expensive with SMAP on x86_64).
Benchmarked with statx(..., AT_EMPTY_PATH, ...) running on Sapphire Rapids, with the "" path for the first two cases and NULL for the last one.
Results in ops/s: stock: 4231237 pre-check: 5944063 (+40%) NULL path: 6601619 (+11%/+56%)
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Tested-by: Xi Ruoyao <[email protected]> [brauner: use path_mounted() and other tweaks] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9, v6.9-rc7 |
|
| #
27a2d0cb |
| 30-Apr-2024 |
Christian Brauner <[email protected]> |
stat: use vfs_empty_path() helper
Use the newly added helper for this.
Signed-off-by: Christian Brauner <[email protected]>
|
| #
9abcfbd2 |
| 20-Jun-2024 |
Prasad Singamsetty <[email protected]> |
block: Add atomic write support for statx
Extend statx system call to return additional info for atomic write support support if the specified file is a block device.
Reviewed-by: Martin K. Peterse
block: Add atomic write support for statx
Extend statx system call to return additional info for atomic write support support if the specified file is a block device.
Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: Prasad Singamsetty <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Keith Busch <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
0f9ca80f |
| 20-Jun-2024 |
Prasad Singamsetty <[email protected]> |
fs: Add initial atomic write support info to statx
Extend statx system call to return additional info for atomic write support support for a file.
Helper function generic_fill_statx_atomic_writes()
fs: Add initial atomic write support info to statx
Extend statx system call to return additional info for atomic write support support for a file.
Helper function generic_fill_statx_atomic_writes() can be used by FSes to fill in the relevant statx fields. For now atomic_write_segments_max will always be 1, otherwise some rules would need to be imposed on iovec length and alignment, which we don't want now.
Signed-off-by: Prasad Singamsetty <[email protected]> jpg: relocate bdev support to another patch Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: John Garry <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
dff60734 |
| 04-Jun-2024 |
Mateusz Guzik <[email protected]> |
vfs: retire user_path_at_empty and drop empty arg from getname_flags
No users after do_readlinkat started doing the job on its own.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lo
vfs: retire user_path_at_empty and drop empty arg from getname_flags
No users after do_readlinkat started doing the job on its own.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
969ce92d |
| 04-Jun-2024 |
Mateusz Guzik <[email protected]> |
vfs: stop using user_path_at_empty in do_readlinkat
It is the only consumer and it saddles getname_flags with an argument set to NULL by everyone else.
Instead the routine can do the empty check on
vfs: stop using user_path_at_empty in do_readlinkat
It is the only consumer and it saddles getname_flags with an argument set to NULL by everyone else.
Instead the routine can do the empty check on its own.
Then user_path_at_empty can get retired and getname_flags can lose the argument.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8 |
|
| #
2a82bb02 |
| 08-Mar-2024 |
Kent Overstreet <[email protected]> |
statx: stx_subvol
Add a new statx field for (sub)volume identifiers, as implemented by btrfs and bcachefs.
This includes bcachefs support; we'll definitely want btrfs support as well.
Link: https:
statx: stx_subvol
Add a new statx field for (sub)volume identifiers, as implemented by btrfs and bcachefs.
This includes bcachefs support; we'll definitely want btrfs support as well.
Link: https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/ Signed-off-by: Kent Overstreet <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Kent Overstreet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Thumshirn <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6 |
|
| #
376870aa |
| 15-Dec-2023 |
Alexander Mikhalitsyn <[email protected]> |
fs: fix doc comment typo fs tree wide
Do the replacement: s/simply passs @nop_mnt_idmap/simply pass @nop_mnt_idmap/ in the fs/ tree.
Found by chance while working on support for idmapped mounts in
fs: fix doc comment typo fs tree wide
Do the replacement: s/simply passs @nop_mnt_idmap/simply pass @nop_mnt_idmap/ in the fs/ tree.
Found by chance while working on support for idmapped mounts in fuse.
Cc: Jan Kara <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6 |
|
| #
98d2b430 |
| 25-Oct-2023 |
Miklos Szeredi <[email protected]> |
add unique mount ID
If a mount is released then its mnt_id can immediately be reused. This is bad news for user interfaces that want to uniquely identify a mount.
Implementing a unique mount ID is
add unique mount ID
If a mount is released then its mnt_id can immediately be reused. This is bad news for user interfaces that want to uniquely identify a mount.
Implementing a unique mount ID is trivial (use a 64bit counter). Unfortunately userspace assumes 32bit size and would overflow after the counter reaches 2^32.
Introduce a new 64bit ID alongside the old one. Initialize the counter to 2^32, this guarantees that the old and new IDs are never mixed up.
Signed-off-by: Miklos Szeredi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Ian Kent <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc7, v6.6-rc6, v6.6-rc5 |
|
| #
8a924db2 |
| 02-Oct-2023 |
Stefan Berger <[email protected]> |
fs: Pass AT_GETATTR_NOSEC flag to getattr interface function
When vfs_getattr_nosec() calls a filesystem's getattr interface function then the 'nosec' should propagate into this function so that vfs
fs: Pass AT_GETATTR_NOSEC flag to getattr interface function
When vfs_getattr_nosec() calls a filesystem's getattr interface function then the 'nosec' should propagate into this function so that vfs_getattr_nosec() can again be called from the filesystem's gettattr rather than vfs_getattr(). The latter would add unnecessary security checks that the initial vfs_getattr_nosec() call wanted to avoid. Therefore, introduce the getattr flag GETATTR_NOSEC and allow to pass with the new getattr_flags parameter to the getattr interface function. In overlayfs and ecryptfs use this flag to determine which one of the two functions to call.
In a recent code change introduced to IMA vfs_getattr_nosec() ended up calling vfs_getattr() in overlayfs, which in turn called security_inode_getattr() on an exiting process that did not have current->fs set anymore, which then caused a kernel NULL pointer dereference. With this change the call to security_inode_getattr() can be avoided, thus avoiding the NULL pointer dereference.
Reported-by: <[email protected]> Fixes: db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version") Cc: Alexander Viro <[email protected]> Cc: <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Amir Goldstein <[email protected]> Cc: Tyler Hicks <[email protected]> Cc: Mimi Zohar <[email protected]> Suggested-by: Christian Brauner <[email protected]> Co-developed-by: Amir Goldstein <[email protected]> Signed-off-by: Stefan Berger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Amir Goldstein <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
16a94965 |
| 04-Oct-2023 |
Jeff Layton <[email protected]> |
fs: convert core infrastructure to new timestamp accessors
Convert the core vfs code to use the new timestamp accessor functions.
Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.
fs: convert core infrastructure to new timestamp accessors
Convert the core vfs code to use the new timestamp accessor functions.
Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc4, v6.6-rc3 |
|
| #
647aa768 |
| 20-Sep-2023 |
Christian Brauner <[email protected]> |
Revert "fs: add infrastructure for multigrain timestamps"
This reverts commit ffb6cf19e06334062744b7e3493f71e500964f8e.
Users reported regressions due to enabling multi-grained timestamps unconditi
Revert "fs: add infrastructure for multigrain timestamps"
This reverts commit ffb6cf19e06334062744b7e3493f71e500964f8e.
Users reported regressions due to enabling multi-grained timestamps unconditionally. As no clear consensus on a solution has come up and the discussion has gone back to the drawing board revert the infrastructure changes for. If it isn't code that's here to stay, make it go away.
Message-ID: <20230920-keine-eile-c9755b5825db@brauner> Acked-by: Jan Kara <[email protected]> Acked-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc2, v6.6-rc1 |
|
| #
42aadec8 |
| 03-Sep-2023 |
Linus Torvalds <[email protected]> |
stat: remove no-longer-used helper macros
The choose_32_64() macros were added to deal with an odd inconsistency between the 32-bit and 64-bit layout of 'struct stat' way back when in commit a52dd97
stat: remove no-longer-used helper macros
The choose_32_64() macros were added to deal with an odd inconsistency between the 32-bit and 64-bit layout of 'struct stat' way back when in commit a52dd971f947 ("vfs: de-crapify "cp_new_stat()" function").
Then a decade later Mikulas noticed that said inconsistency had been a mistake in the early x86-64 port, and shouldn't have existed in the first place. So commit 932aba1e1690 ("stat: fix inconsistency between struct stat and struct compat_stat") removed the uses of the helpers.
But the helpers remained around, unused.
Get rid of them.
Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
9013c51c |
| 03-Sep-2023 |
Linus Torvalds <[email protected]> |
vfs: mostly undo glibc turning 'fstat()' into 'fstatat(AT_EMPTY_PATH)'
Mateusz reports that glibc turns 'fstat()' calls into 'fstatat()', and that seems to have been going on for quite a long time d
vfs: mostly undo glibc turning 'fstat()' into 'fstatat(AT_EMPTY_PATH)'
Mateusz reports that glibc turns 'fstat()' calls into 'fstatat()', and that seems to have been going on for quite a long time due to glibc having tried to simplify its stat logic into just one point.
This turns out to cause completely unnecessary overhead, where we then go off and allocate the kernel side pathname, and actually look up the empty path. Sure, our path lookup is quite optimized, but it still causes a fair bit of allocation overhead and a couple of completely unnecessary rounds of lockref accesses etc.
This is all hopefully getting fixed in user space, and there is a patch floating around for just having glibc use the native fstat() system call. But even with the current situation we can at least improve on things by catching the situation and short-circuiting it.
Note that this is still measurably slower than just a plain 'fstat()', since just checking that the filename is actually empty is somewhat expensive due to inevitable user space access overhead from the kernel (ie verifying pointers, and SMAP on x86). But it's still quite a bit faster than actually looking up the path for real.
To quote numers from Mateusz: "Sapphire Rapids, will-it-scale, ops/s
stock fstat 5088199 patched fstat 7625244 (+49%) real fstat 8540383 (+67% / +12%)"
where that 'stock fstat' is the glibc translation of fstat into fstatat() with an empty path, the 'patched fstat' is with this short circuiting of the path lookup, and the 'real fstat' is the actual native fstat() system call with none of this overhead.
Link: https://lore.kernel.org/lkml/20230903204858.lv7i3kqvw6eamhgz@f/ Reported-by: Mateusz Guzik <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.5, v6.5-rc7, v6.5-rc6 |
|
| #
ffb6cf19 |
| 07-Aug-2023 |
Jeff Layton <[email protected]> |
fs: add infrastructure for multigrain timestamps
The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optim
fs: add infrastructure for multigrain timestamps
The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications).
If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates.
What we need is a way to only use fine-grained timestamps when they are being actively queried.
POSIX generally mandates that when the the mtime changes, the ctime must also change. The kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used.
Use the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps.
Later patches will convert individual filesystems to use the new infrastructure.
Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|