|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
f70681e9 |
| 20-Mar-2025 |
Yongjian Sun <[email protected]> |
libfs: Fix duplicate directory entry in offset_dir_lookup
There is an issue in the kernel:
In tmpfs, when using the "ls" command to list the contents of a directory with a large number of files, gl
libfs: Fix duplicate directory entry in offset_dir_lookup
There is an issue in the kernel:
In tmpfs, when using the "ls" command to list the contents of a directory with a large number of files, glibc performs the getdents call in multiple rounds. If a concurrent unlink occurs between these getdents calls, it may lead to duplicate directory entries in the ls output. One possible reproduction scenario is as follows:
Create 1026 files and execute ls and rm concurrently:
for i in {1..1026}; do echo "This is file $i" > /tmp/dir/file$i done
ls /tmp/dir rm /tmp/dir/file4 ->getdents(file1026-file5) ->unlink(file4)
->getdents(file5,file3,file2,file1)
It is expected that the second getdents call to return file3 through file1, but instead it returns an extra file5.
The root cause of this problem is in the offset_dir_lookup function. It uses mas_find to determine the starting position for the current getdents call. Since mas_find locates the first position that is greater than or equal to mas->index, when file4 is deleted, it ends up returning file5.
It can be fixed by replacing mas_find with mas_find_rev, which finds the first position that is less than or equal to mas->index.
Fixes: b9b588f22a0c ("libfs: Use d_children list to iterate simple_offset directories") Signed-off-by: Yongjian Sun <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Chuck Lever <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6 |
|
| #
45135229 |
| 05-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: record exit code and cgroupid at exit
Record the exit code and cgroupid in release_task() and stash in struct pidfs_exit_info so it can be retrieved even after the task has been reaped.
Link
pidfs: record exit code and cgroupid at exit
Record the exit code and cgroupid in release_task() and stash in struct pidfs_exit_info so it can be retrieved even after the task has been reaped.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5 |
|
| #
4c43ab1b |
| 23-Dec-2024 |
Al Viro <[email protected]> |
generic_ci_d_compare(): use shortname_storage
... and check the "name might be unstable" predicate the right way.
Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Gabriel Krisman Bertazi
generic_ci_d_compare(): use shortname_storage
... and check the "name might be unstable" predicate the right way.
Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
b9b588f2 |
| 28-Dec-2024 |
Chuck Lever <[email protected]> |
libfs: Use d_children list to iterate simple_offset directories
The mtree mechanism has been effective at creating directory offsets that are stable over multiple opendir instances. However, it has
libfs: Use d_children list to iterate simple_offset directories
The mtree mechanism has been effective at creating directory offsets that are stable over multiple opendir instances. However, it has not been able to handle the subtleties of renames that are concurrent with readdir.
Instead of using the mtree to emit entries in the order of their offset values, use it only to map incoming ctx->pos to a starting entry. Then use the directory's d_children list, which is already maintained properly by the dcache, to find the next child to emit.
One of the sneaky things about this is that when the mtree-allocated offset value wraps (which is very rare), looking up ctx->pos++ is not going to find the next entry; it will return NULL. Instead, by following the d_children list, the offset values can appear in any order but all of the entries in the directory will be visited eventually.
Note also that the readdir() is guaranteed to reach the tail of this list. Entries are added only at the head of d_children, and readdir walks from its current position in that list towards its tail.
Signed-off-by: Chuck Lever <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
68a3a650 |
| 28-Dec-2024 |
Chuck Lever <[email protected]> |
libfs: Replace simple_offset end-of-directory detection
According to getdents(3), the d_off field in each returned directory entry points to the next entry in the directory. The d_off field in the l
libfs: Replace simple_offset end-of-directory detection
According to getdents(3), the d_off field in each returned directory entry points to the next entry in the directory. The d_off field in the last returned entry in the readdir buffer must contain a valid offset value, but if it points to an actual directory entry, then readdir/getdents can loop.
This patch introduces a specific fixed offset value that is placed in the d_off field of the last entry in a directory. Some user space applications assume that the EOD offset value is larger than the offsets of real directory entries, so the largest valid offset value is reserved for this purpose. This new value is never allocated by simple_offset_add().
When ->iterate_dir() returns, getdents{64} inserts the ctx->pos value into the d_off field of the last valid entry in the readdir buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD offset value so the last entry is updated to point to the EOD marker.
When trying to read the entry at the EOD offset, offset_readdir() terminates immediately.
It is worth noting that using a Maple tree for directory offset value allocation does not guarantee a 63-bit range of values -- on platforms where "long" is a 32-bit type, the directory offset value range is still 0..(2^31 - 1). For broad compatibility with 32-bit user space, the largest tmpfs directory cookie value is now S32_MAX.
Fixes: 796432efab1e ("libfs: getdents() should return 0 after reaching EOD") Signed-off-by: Chuck Lever <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
b662d858 |
| 28-Dec-2024 |
Chuck Lever <[email protected]> |
Revert "libfs: fix infinite directory reads for offset dir"
The current directory offset allocator (based on mtree_alloc_cyclic) stores the next offset value to return in octx->next_offset. This mec
Revert "libfs: fix infinite directory reads for offset dir"
The current directory offset allocator (based on mtree_alloc_cyclic) stores the next offset value to return in octx->next_offset. This mechanism typically returns values that increase monotonically over time. Eventually, though, the newly allocated offset value wraps back to a low number (say, 2) which is smaller than other already- allocated offset values.
Yu Kuai <[email protected]> reports that, after commit 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir"), if a directory's offset allocator wraps, existing entries are no longer visible via readdir/getdents because offset_readdir() stops listing entries once an entry's offset is larger than octx->next_offset. These entries vanish persistently -- they can be looked up, but will never again appear in readdir(3) output.
The reason for this is that the commit treats directory offsets as monotonically increasing integer values rather than opaque cookies, and introduces this comparison:
if (dentry2offset(dentry) >= last_index) {
On 64-bit platforms, the directory offset value upper bound is 2^63 - 1. Directory offsets will monotonically increase for millions of years without wrapping.
On 32-bit platforms, however, LONG_MAX is 2^31 - 1. The allocator can wrap after only a few weeks (at worst).
Revert commit 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir") to prepare for a fix that can work properly on 32-bit systems and might apply to recent LTS kernels where shmem employs the simple_offset mechanism.
Reported-by: Yu Kuai <[email protected]> Signed-off-by: Chuck Lever <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Yang Erkun <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
d7bde4f2 |
| 28-Dec-2024 |
Chuck Lever <[email protected]> |
Revert "libfs: Add simple_offset_empty()"
simple_empty() and simple_offset_empty() perform the same task. The latter's use as a canary to find bugs has not found any new issues. A subsequent patch w
Revert "libfs: Add simple_offset_empty()"
simple_empty() and simple_offset_empty() perform the same task. The latter's use as a canary to find bugs has not found any new issues. A subsequent patch will remove the use of the mtree for iterating directory contents, so revert back to using a similar mechanism for determining whether a directory is indeed empty.
Only one such mechanism is ever needed.
Signed-off-by: Chuck Lever <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Yang Erkun <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
903dc9c4 |
| 28-Dec-2024 |
Chuck Lever <[email protected]> |
libfs: Return ENOSPC when the directory offset range is exhausted
Testing shows that the EBUSY error return from mtree_alloc_cyclic() leaks into user space. The ERRORS section of "man creat(2)" says
libfs: Return ENOSPC when the directory offset range is exhausted
Testing shows that the EBUSY error return from mtree_alloc_cyclic() leaks into user space. The ERRORS section of "man creat(2)" says:
> EBUSY O_EXCL was specified in flags and pathname refers > to a block device that is in use by the system > (e.g., it is mounted).
ENOSPC is closer to what applications expect in this situation.
Note that the normal range of simple directory offset values is 2..2^63, so hitting this error is going to be rare to impossible.
Fixes: 6faddda69f62 ("libfs: Add directory operations for stable offsets") Cc: [email protected] # v6.9+ Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Yang Erkun <[email protected]> Signed-off-by: Chuck Lever <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
d2ab36bb |
| 29-Nov-2024 |
Erin Shepherd <[email protected]> |
pseudofs: add support for export_ops
Pseudo-filesystems might reasonably wish to implement the export ops (particularly for name_to_handle_at/open_by_handle_at); plumb this through pseudo_fs_context
pseudofs: add support for export_ops
Pseudo-filesystems might reasonably wish to implement the export ops (particularly for name_to_handle_at/open_by_handle_at); plumb this through pseudo_fs_context
Reviewed-by: Amir Goldstein <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Erin Shepherd <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6 |
|
| #
6c056ae4 |
| 03-Nov-2024 |
Al Viro <[email protected]> |
libfs: kill empty_dir_getattr()
It's used only to initialize ->getattr in one inode_operations instance (empty_dir_inode_operations) and its behaviour had always been equivalent to what we get with
libfs: kill empty_dir_getattr()
It's used only to initialize ->getattr in one inode_operations instance (empty_dir_inode_operations) and its behaviour had always been equivalent to what we get with NULL ->getattr.
Just remove that initializer, along with empty_dir_getattr() itself. While we are at it, the same instance has ->permission initialized to generic_permission, which is what NULL ->permission ends up doing. Again, no point keeping it.
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc5 |
|
| #
58e55efd |
| 21-Oct-2024 |
André Almeida <[email protected]> |
tmpfs: Add casefold lookup support
Enable casefold lookup in tmpfs, based on the encoding defined by userspace. That means that instead of comparing byte per byte a file name, it compares to a case-
tmpfs: Add casefold lookup support
Enable casefold lookup in tmpfs, based on the encoding defined by userspace. That means that instead of comparing byte per byte a file name, it compares to a case-insensitive equivalent of the Unicode string.
* Dcache handling
There's a special need when dealing with case-insensitive dentries. First of all, we currently invalidated every negative casefold dentries. That happens because currently VFS code has no proper support to deal with that, giving that it could incorrectly reuse a previous filename for a new file that has a casefold match. For instance, this could happen:
$ mkdir DIR $ rm -r DIR $ mkdir dir $ ls DIR/
And would be perceived as inconsistency from userspace point of view, because even that we match files in a case-insensitive manner, we still honor whatever is the initial filename.
Along with that, tmpfs stores only the first equivalent name dentry used in the dcache, preventing duplications of dentries in the dcache. The d_compare() version for casefold files uses a normalized string, so the filename under lookup will be compared to another normalized string for the existing file, achieving a casefolded lookup.
* Enabling casefold via mount options
Most filesystems have their data stored in disk, so casefold option need to be enabled when building a filesystem on a device (via mkfs). However, as tmpfs is a RAM backed filesystem, there's no disk information and thus no mkfs to store information about casefold.
For tmpfs, create casefold options for mounting. Userspace can then enable casefold support for a mount point using:
$ mount -t tmpfs -o casefold=utf8-12.1.0 fs_name mount_dir/
Userspace must set what Unicode standard is aiming to. The available options depends on what the kernel Unicode subsystem supports.
And for strict encoding:
$ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name mount_dir/
Strict encoding means that tmpfs will refuse to create invalid UTF-8 sequences. When this option is not enabled, any invalid sequence will be treated as an opaque byte sequence, ignoring the encoding thus not being able to be looked up in a case-insensitive way.
* Check for casefold dirs on simple_lookup()
On simple_lookup(), do not create dentries for casefold directories. Currently, VFS does not support case-insensitive negative dentries and can create inconsistencies in the filesystem. Prevent such dentries to being created in the first place.
Reviewed-by: Gabriel Krisman Bertazi <[email protected]> Reviewed-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: André Almeida <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
458532c8 |
| 21-Oct-2024 |
André Almeida <[email protected]> |
libfs: Export generic_ci_ dentry functions
Export generic_ci_ dentry functions so they can be used by case-insensitive filesystems that need something more custom than the default one set by `struct
libfs: Export generic_ci_ dentry functions
Export generic_ci_ dentry functions so they can be used by case-insensitive filesystems that need something more custom than the default one set by `struct generic_ci_dentry_ops`.
Reviewed-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: André Almeida <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7 |
|
| #
4e32c25b |
| 06-Sep-2024 |
Christian Brauner <[email protected]> |
libfs: fix get_stashed_dentry()
get_stashed_dentry() tries to optimistically retrieve a stashed dentry from a provided location. It needs to ensure to hold rcu lock before it dereference the stashe
libfs: fix get_stashed_dentry()
get_stashed_dentry() tries to optimistically retrieve a stashed dentry from a provided location. It needs to ensure to hold rcu lock before it dereference the stashed location to prevent UAF issues. Use rcu_dereference() instead of READ_ONCE() it's effectively equivalent with some lockdep bells and whistles and it communicates clearly that this expects rcu protection.
Link: https://lore.kernel.org/r/20240906-vfs-hotfix-5959800ffa68@brauner Fixes: 07fd7c329839 ("libfs: add path_from_stashed()") Reported-by: [email protected] Fixes: [email protected] Reported-by: [email protected] Fixes: [email protected] Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4 |
|
| #
b381fbbc |
| 15-Aug-2024 |
Mateusz Guzik <[email protected]> |
vfs: elide smp_mb in iversion handling in the common case
According to bpftrace on these routines most calls result in cmpxchg, which already provides the same guarantee.
In inode_maybe_inc_iversio
vfs: elide smp_mb in iversion handling in the common case
According to bpftrace on these routines most calls result in cmpxchg, which already provides the same guarantee.
In inode_maybe_inc_iversion elision is possible because even if the wrong value was read due to now missing smp_mb fence, the issue is going to correct itself after cmpxchg. If it appears cmpxchg wont be issued, the fence + reload are there bringing back previous behavior.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc3, v6.11-rc2 |
|
| #
64a7ce76 |
| 31-Jul-2024 |
yangerkun <[email protected]> |
libfs: fix infinite directory reads for offset dir
After we switch tmpfs dir operations from simple_dir_operations to simple_offset_dir_operations, every rename happened will fill new dentry to dest
libfs: fix infinite directory reads for offset dir
After we switch tmpfs dir operations from simple_dir_operations to simple_offset_dir_operations, every rename happened will fill new dentry to dest dir's maple tree(&SHMEM_I(inode)->dir_offsets->mt) with a free key starting with octx->newx_offset, and then set newx_offset equals to free key + 1. This will lead to infinite readdir combine with rename happened at the same time, which fail generic/736 in xfstests(detail show as below).
1. create 5000 files(1 2 3...) under one dir 2. call readdir(man 3 readdir) once, and get one entry 3. rename(entry, "TEMPFILE"), then rename("TEMPFILE", entry) 4. loop 2~3, until readdir return nothing or we loop too many times(tmpfs break test with the second condition)
We choose the same logic what commit 9b378f6ad48cf ("btrfs: fix infinite directory reads") to fix it, record the last_index when we open dir, and do not emit the entry which index >= last_index. The file->private_data now used in offset dir can use directly to do this, and we also update the last_index when we llseek the dir file.
Fixes: a2e459555c5f ("shmem: stable directory offsets") Signed-off-by: yangerkun <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Chuck Lever <[email protected]> [brauner: only update last_index after seek when offset is zero like Jan suggested] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1 |
|
| #
1da86618 |
| 15-Jul-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
fs: Convert aops->write_begin to take a folio
Convert all callers from working on a page to working on one page of a folio (support for working on an entire folio can come later). Removes a lot of f
fs: Convert aops->write_begin to take a folio
Convert all callers from working on a page to working on one page of a folio (support for working on an entire folio can come later). Removes a lot of folio->page->folio conversions.
Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.10 |
|
| #
a225800f |
| 10-Jul-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
fs: Convert aops->write_end to take a folio
Most callers have a folio, and most implementations operate on a folio, so remove the conversion from folio->page->folio to fit through this interface.
R
fs: Convert aops->write_end to take a folio
Most callers have a folio, and most implementations operate on a folio, so remove the conversion from folio->page->folio to fit through this interface.
Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3 |
|
| #
6a79a4e1 |
| 06-Jun-2024 |
Gabriel Krisman Bertazi <[email protected]> |
libfs: Introduce case-insensitive string comparison helper
generic_ci_match can be used by case-insensitive filesystems to compare strings under lookup with dirents in a case-insensitive way. This
libfs: Introduce case-insensitive string comparison helper
generic_ci_match can be used by case-insensitive filesystems to compare strings under lookup with dirents in a case-insensitive way. This function is currently reimplemented by each filesystem supporting casefolding, so this reduces code duplication in filesystem-specific code.
[[email protected]: rework to first test the exact match, cleanup and add error message]
Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Eugen Hristev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5 |
|
| #
ad191eb6 |
| 15-Apr-2024 |
Chuck Lever <[email protected]> |
shmem: Fix shmem_rename2()
When renaming onto an existing directory entry, user space expects the replacement entry to have the same directory offset as the original one.
Link: https://gitlab.alpin
shmem: Fix shmem_rename2()
When renaming onto an existing directory entry, user space expects the replacement entry to have the same directory offset as the original one.
Link: https://gitlab.alpinelinux.org/alpine/aports/-/issues/15966 Fixes: a2e459555c5f ("shmem: stable directory offsets") Signed-off-by: Chuck Lever <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
5a1a25be |
| 15-Apr-2024 |
Chuck Lever <[email protected]> |
libfs: Add simple_offset_rename() API
I'm about to fix a tmpfs rename bug that requires the use of internal simple_offset helpers that are not available in mm/shmem.c
Signed-off-by: Chuck Lever <ch
libfs: Add simple_offset_rename() API
I'm about to fix a tmpfs rename bug that requires the use of internal simple_offset helpers that are not available in mm/shmem.c
Signed-off-by: Chuck Lever <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
23cdd0ee |
| 15-Apr-2024 |
Chuck Lever <[email protected]> |
libfs: Fix simple_offset_rename_exchange()
User space expects the replacement (old) directory entry to have the same directory offset after the rename.
Suggested-by: Christian Brauner <brauner@kern
libfs: Fix simple_offset_rename_exchange()
User space expects the replacement (old) directory entry to have the same directory offset after the rename.
Suggested-by: Christian Brauner <[email protected]> Fixes: a2e459555c5f ("shmem: stable directory offsets") Signed-off-by: Chuck Lever <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
9d9539db |
| 12-Mar-2024 |
Christian Brauner <[email protected]> |
pidfs: remove config option
As Linus suggested this enables pidfs unconditionally. A key property to retain is the ability to compare pidfds by inode number (cf. [1]). That's extremely helpful just
pidfs: remove config option
As Linus suggested this enables pidfs unconditionally. A key property to retain is the ability to compare pidfds by inode number (cf. [1]). That's extremely helpful just as comparing namespace file descriptors by inode number is. They are used in a variety of scenarios where they need to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a socket to trivially authenticate a the sender and various other use-cases.
For 64bit systems this is pretty trivial to do. For 32bit it's slightly more annoying as we discussed but we simply add a dumb ida based allocator that gets used on 32bit. This gives the same guarantees about inode numbers on 64bit without any overflow risk. Practically, we'll never run into overflow issues because we're constrained by the number of processes that can exist on 32bit and by the number of open files that can exist on a 32bit system. On 64bit none of this matters and things are very simple.
If 32bit also needs the uniqueness guarantee they can simply parse the contents of /proc/<pid>/fd/<nr>. The uniqueness guarantees have a variety of use-cases. One of the most obvious ones is that they will make pidfiles (or "pidfdfiles", I guess) reliable as the unique identifier can be placed into there that won't be reycled. Also a frequent request.
Note, I took the chance and simplified path_from_stashed() even further. Instead of passing the inode number explicitly to path_from_stashed() we let the filesystem handle that internally. So path_from_stashed() ends up even simpler than it is now. This is also a good solution allowing the cleanup code to be clean and consistent between 32bit and 64bit. The cleanup path in prepare_anon_dentry() is also switched around so we put the inode before the dentry allocation. This means we only have to call the cleanup handler for the filesystem's inode data once and can rely ->evict_inode() otherwise.
Aside from having to have a bit of extra code for 32bit it actually ends up a nice cleanup for path_from_stashed() imho.
Tested on both 32 and 64bit including error injection.
Link: https://github.com/systemd/systemd/pull/31713 [1] Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7 |
|
| #
e9c5263c |
| 01-Mar-2024 |
Christian Brauner <[email protected]> |
libfs: improve path_from_stashed()
Right now we pass a bunch of info that is fs specific which doesn't make a lot of sense and it bleeds fs sepcific details into the generic helper. nsfs and pidfs h
libfs: improve path_from_stashed()
Right now we pass a bunch of info that is fs specific which doesn't make a lot of sense and it bleeds fs sepcific details into the generic helper. nsfs and pidfs have slightly different needs when initializing inodes. Add simple operations that are stashed in sb->s_fs_info that both can implement. This also allows us to get rid of cleaning up references in the caller. All in all path_from_stashed() becomes way simpler.
Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6 |
|
| #
2558e3b2 |
| 21-Feb-2024 |
Christian Brauner <[email protected]> |
libfs: add stashed_dentry_prune()
Both pidfs and nsfs use a memory location to stash a dentry for reuse by concurrent openers. Right now two custom dentry->d_prune::{ns,pidfs}_prune_dentry() methods
libfs: add stashed_dentry_prune()
Both pidfs and nsfs use a memory location to stash a dentry for reuse by concurrent openers. Right now two custom dentry->d_prune::{ns,pidfs}_prune_dentry() methods are needed that do the same thing. The only thing that differs is that they need to get to the memory location to store or retrieve the dentry from differently. Fix that by remember the stashing location for the dentry in dentry->d_fsdata which allows us to retrieve it in dentry->d_prune. That in turn makes it possible to add a common helper that pidfs and nsfs can both use.
Link: https://lore.kernel.org/r/CAHk-=wg8cHY=i3m6RnXQ2Y2W8psicKWQEZq1=94ivUiviM-0OA@mail.gmail.com Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc5 |
|
| #
159a0d9f |
| 18-Feb-2024 |
Christian Brauner <[email protected]> |
libfs: improve path_from_stashed() helper
In earlier patches we moved both nsfs and pidfs to path_from_stashed(). The helper currently tries to add and stash a new dentry if a reusable dentry couldn
libfs: improve path_from_stashed() helper
In earlier patches we moved both nsfs and pidfs to path_from_stashed(). The helper currently tries to add and stash a new dentry if a reusable dentry couldn't be found and returns EAGAIN if it lost the race to stash the dentry. The caller can use EAGAIN to retry.
The helper and the two filesystems be written in a way that makes returning EAGAIN unnecessary. To do this we need to change the dentry->d_prune() implementation of nsfs and pidfs to not simply replace the stashed dentry with NULL but to use a cmpxchg() and only replace their own dentry.
Then path_from_stashed() can then be changed to not just stash a new dentry when no dentry is currently stashed but also when an already dead dentry is stashed. If another task managed to install a dentry in the meantime it can simply be reused. Pack that into a loop and call it a day.
Suggested-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/CAHk-=wgtLF5Z5=15-LKAczWm=-tUjHO+Bpf7WjBG+UU3s=fEQw@mail.gmail.com Signed-off-by: Christian Brauner <[email protected]>
show more ...
|