History log of /linux-6.15/fs/libfs.c (Results 1 – 25 of 238)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14
# f70681e9 20-Mar-2025 Yongjian Sun <[email protected]>

libfs: Fix duplicate directory entry in offset_dir_lookup

There is an issue in the kernel:

In tmpfs, when using the "ls" command to list the contents
of a directory with a large number of files, gl

libfs: Fix duplicate directory entry in offset_dir_lookup

There is an issue in the kernel:

In tmpfs, when using the "ls" command to list the contents
of a directory with a large number of files, glibc performs
the getdents call in multiple rounds. If a concurrent unlink
occurs between these getdents calls, it may lead to duplicate
directory entries in the ls output. One possible reproduction
scenario is as follows:

Create 1026 files and execute ls and rm concurrently:

for i in {1..1026}; do
echo "This is file $i" > /tmp/dir/file$i
done

ls /tmp/dir rm /tmp/dir/file4
->getdents(file1026-file5)
->unlink(file4)

->getdents(file5,file3,file2,file1)

It is expected that the second getdents call to return file3
through file1, but instead it returns an extra file5.

The root cause of this problem is in the offset_dir_lookup
function. It uses mas_find to determine the starting position
for the current getdents call. Since mas_find locates the first
position that is greater than or equal to mas->index, when file4
is deleted, it ends up returning file5.

It can be fixed by replacing mas_find with mas_find_rev, which
finds the first position that is less than or equal to mas->index.

Fixes: b9b588f22a0c ("libfs: Use d_children list to iterate simple_offset directories")
Signed-off-by: Yongjian Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Chuck Lever <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc7, v6.14-rc6
# 45135229 05-Mar-2025 Christian Brauner <[email protected]>

pidfs: record exit code and cgroupid at exit

Record the exit code and cgroupid in release_task() and stash in struct
pidfs_exit_info so it can be retrieved even after the task has been
reaped.

Link

pidfs: record exit code and cgroupid at exit

Record the exit code and cgroupid in release_task() and stash in struct
pidfs_exit_info so it can be retrieved even after the task has been
reaped.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5
# 4c43ab1b 23-Dec-2024 Al Viro <[email protected]>

generic_ci_d_compare(): use shortname_storage

... and check the "name might be unstable" predicate
the right way.

Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Gabriel Krisman Bertazi

generic_ci_d_compare(): use shortname_storage

... and check the "name might be unstable" predicate
the right way.

Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


# b9b588f2 28-Dec-2024 Chuck Lever <[email protected]>

libfs: Use d_children list to iterate simple_offset directories

The mtree mechanism has been effective at creating directory offsets
that are stable over multiple opendir instances. However, it has

libfs: Use d_children list to iterate simple_offset directories

The mtree mechanism has been effective at creating directory offsets
that are stable over multiple opendir instances. However, it has not
been able to handle the subtleties of renames that are concurrent
with readdir.

Instead of using the mtree to emit entries in the order of their
offset values, use it only to map incoming ctx->pos to a starting
entry. Then use the directory's d_children list, which is already
maintained properly by the dcache, to find the next child to emit.

One of the sneaky things about this is that when the mtree-allocated
offset value wraps (which is very rare), looking up ctx->pos++ is
not going to find the next entry; it will return NULL. Instead, by
following the d_children list, the offset values can appear in any
order but all of the entries in the directory will be visited
eventually.

Note also that the readdir() is guaranteed to reach the tail of this
list. Entries are added only at the head of d_children, and readdir
walks from its current position in that list towards its tail.

Signed-off-by: Chuck Lever <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 68a3a650 28-Dec-2024 Chuck Lever <[email protected]>

libfs: Replace simple_offset end-of-directory detection

According to getdents(3), the d_off field in each returned directory
entry points to the next entry in the directory. The d_off field in
the l

libfs: Replace simple_offset end-of-directory detection

According to getdents(3), the d_off field in each returned directory
entry points to the next entry in the directory. The d_off field in
the last returned entry in the readdir buffer must contain a valid
offset value, but if it points to an actual directory entry, then
readdir/getdents can loop.

This patch introduces a specific fixed offset value that is placed
in the d_off field of the last entry in a directory. Some user space
applications assume that the EOD offset value is larger than the
offsets of real directory entries, so the largest valid offset value
is reserved for this purpose. This new value is never allocated by
simple_offset_add().

When ->iterate_dir() returns, getdents{64} inserts the ctx->pos
value into the d_off field of the last valid entry in the readdir
buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD
offset value so the last entry is updated to point to the EOD marker.

When trying to read the entry at the EOD offset, offset_readdir()
terminates immediately.

It is worth noting that using a Maple tree for directory offset
value allocation does not guarantee a 63-bit range of values --
on platforms where "long" is a 32-bit type, the directory offset
value range is still 0..(2^31 - 1). For broad compatibility with
32-bit user space, the largest tmpfs directory cookie value is now
S32_MAX.

Fixes: 796432efab1e ("libfs: getdents() should return 0 after reaching EOD")
Signed-off-by: Chuck Lever <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# b662d858 28-Dec-2024 Chuck Lever <[email protected]>

Revert "libfs: fix infinite directory reads for offset dir"

The current directory offset allocator (based on mtree_alloc_cyclic)
stores the next offset value to return in octx->next_offset. This
mec

Revert "libfs: fix infinite directory reads for offset dir"

The current directory offset allocator (based on mtree_alloc_cyclic)
stores the next offset value to return in octx->next_offset. This
mechanism typically returns values that increase monotonically over
time. Eventually, though, the newly allocated offset value wraps
back to a low number (say, 2) which is smaller than other already-
allocated offset values.

Yu Kuai <[email protected]> reports that, after commit 64a7ce76fb90
("libfs: fix infinite directory reads for offset dir"), if a
directory's offset allocator wraps, existing entries are no longer
visible via readdir/getdents because offset_readdir() stops listing
entries once an entry's offset is larger than octx->next_offset.
These entries vanish persistently -- they can be looked up, but will
never again appear in readdir(3) output.

The reason for this is that the commit treats directory offsets as
monotonically increasing integer values rather than opaque cookies,
and introduces this comparison:

if (dentry2offset(dentry) >= last_index) {

On 64-bit platforms, the directory offset value upper bound is
2^63 - 1. Directory offsets will monotonically increase for millions
of years without wrapping.

On 32-bit platforms, however, LONG_MAX is 2^31 - 1. The allocator
can wrap after only a few weeks (at worst).

Revert commit 64a7ce76fb90 ("libfs: fix infinite directory reads for
offset dir") to prepare for a fix that can work properly on 32-bit
systems and might apply to recent LTS kernels where shmem employs
the simple_offset mechanism.

Reported-by: Yu Kuai <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Yang Erkun <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# d7bde4f2 28-Dec-2024 Chuck Lever <[email protected]>

Revert "libfs: Add simple_offset_empty()"

simple_empty() and simple_offset_empty() perform the same task.
The latter's use as a canary to find bugs has not found any new
issues. A subsequent patch w

Revert "libfs: Add simple_offset_empty()"

simple_empty() and simple_offset_empty() perform the same task.
The latter's use as a canary to find bugs has not found any new
issues. A subsequent patch will remove the use of the mtree for
iterating directory contents, so revert back to using a similar
mechanism for determining whether a directory is indeed empty.

Only one such mechanism is ever needed.

Signed-off-by: Chuck Lever <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Yang Erkun <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 903dc9c4 28-Dec-2024 Chuck Lever <[email protected]>

libfs: Return ENOSPC when the directory offset range is exhausted

Testing shows that the EBUSY error return from mtree_alloc_cyclic()
leaks into user space. The ERRORS section of "man creat(2)" says

libfs: Return ENOSPC when the directory offset range is exhausted

Testing shows that the EBUSY error return from mtree_alloc_cyclic()
leaks into user space. The ERRORS section of "man creat(2)" says:

> EBUSY O_EXCL was specified in flags and pathname refers
> to a block device that is in use by the system
> (e.g., it is mounted).

ENOSPC is closer to what applications expect in this situation.

Note that the normal range of simple directory offset values is
2..2^63, so hitting this error is going to be rare to impossible.

Fixes: 6faddda69f62 ("libfs: Add directory operations for stable offsets")
Cc: [email protected] # v6.9+
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Yang Erkun <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1
# d2ab36bb 29-Nov-2024 Erin Shepherd <[email protected]>

pseudofs: add support for export_ops

Pseudo-filesystems might reasonably wish to implement the export ops
(particularly for name_to_handle_at/open_by_handle_at); plumb this
through pseudo_fs_context

pseudofs: add support for export_ops

Pseudo-filesystems might reasonably wish to implement the export ops
(particularly for name_to_handle_at/open_by_handle_at); plumb this
through pseudo_fs_context

Reviewed-by: Amir Goldstein <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Erin Shepherd <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.12, v6.12-rc7, v6.12-rc6
# 6c056ae4 03-Nov-2024 Al Viro <[email protected]>

libfs: kill empty_dir_getattr()

It's used only to initialize ->getattr in one inode_operations instance
(empty_dir_inode_operations) and its behaviour had always been equivalent
to what we get with

libfs: kill empty_dir_getattr()

It's used only to initialize ->getattr in one inode_operations instance
(empty_dir_inode_operations) and its behaviour had always been equivalent
to what we get with NULL ->getattr.

Just remove that initializer, along with empty_dir_getattr() itself.
While we are at it, the same instance has ->permission initialized to
generic_permission, which is what NULL ->permission ends up doing.
Again, no point keeping it.

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


Revision tags: v6.12-rc5
# 58e55efd 21-Oct-2024 André Almeida <[email protected]>

tmpfs: Add casefold lookup support

Enable casefold lookup in tmpfs, based on the encoding defined by
userspace. That means that instead of comparing byte per byte a file
name, it compares to a case-

tmpfs: Add casefold lookup support

Enable casefold lookup in tmpfs, based on the encoding defined by
userspace. That means that instead of comparing byte per byte a file
name, it compares to a case-insensitive equivalent of the Unicode
string.

* Dcache handling

There's a special need when dealing with case-insensitive dentries.
First of all, we currently invalidated every negative casefold dentries.
That happens because currently VFS code has no proper support to deal
with that, giving that it could incorrectly reuse a previous filename
for a new file that has a casefold match. For instance, this could
happen:

$ mkdir DIR
$ rm -r DIR
$ mkdir dir
$ ls
DIR/

And would be perceived as inconsistency from userspace point of view,
because even that we match files in a case-insensitive manner, we still
honor whatever is the initial filename.

Along with that, tmpfs stores only the first equivalent name dentry used
in the dcache, preventing duplications of dentries in the dcache. The
d_compare() version for casefold files uses a normalized string, so the
filename under lookup will be compared to another normalized string for
the existing file, achieving a casefolded lookup.

* Enabling casefold via mount options

Most filesystems have their data stored in disk, so casefold option need
to be enabled when building a filesystem on a device (via mkfs).
However, as tmpfs is a RAM backed filesystem, there's no disk
information and thus no mkfs to store information about casefold.

For tmpfs, create casefold options for mounting. Userspace can then
enable casefold support for a mount point using:

$ mount -t tmpfs -o casefold=utf8-12.1.0 fs_name mount_dir/

Userspace must set what Unicode standard is aiming to. The available
options depends on what the kernel Unicode subsystem supports.

And for strict encoding:

$ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name mount_dir/

Strict encoding means that tmpfs will refuse to create invalid UTF-8
sequences. When this option is not enabled, any invalid sequence will be
treated as an opaque byte sequence, ignoring the encoding thus not being
able to be looked up in a case-insensitive way.

* Check for casefold dirs on simple_lookup()

On simple_lookup(), do not create dentries for casefold directories.
Currently, VFS does not support case-insensitive negative dentries and
can create inconsistencies in the filesystem. Prevent such dentries to
being created in the first place.

Reviewed-by: Gabriel Krisman Bertazi <[email protected]>
Reviewed-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: André Almeida <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 458532c8 21-Oct-2024 André Almeida <[email protected]>

libfs: Export generic_ci_ dentry functions

Export generic_ci_ dentry functions so they can be used by
case-insensitive filesystems that need something more custom than the
default one set by `struct

libfs: Export generic_ci_ dentry functions

Export generic_ci_ dentry functions so they can be used by
case-insensitive filesystems that need something more custom than the
default one set by `struct generic_ci_dentry_ops`.

Reviewed-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: André Almeida <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7
# 4e32c25b 06-Sep-2024 Christian Brauner <[email protected]>

libfs: fix get_stashed_dentry()

get_stashed_dentry() tries to optimistically retrieve a stashed dentry
from a provided location. It needs to ensure to hold rcu lock before it
dereference the stashe

libfs: fix get_stashed_dentry()

get_stashed_dentry() tries to optimistically retrieve a stashed dentry
from a provided location. It needs to ensure to hold rcu lock before it
dereference the stashed location to prevent UAF issues. Use
rcu_dereference() instead of READ_ONCE() it's effectively equivalent
with some lockdep bells and whistles and it communicates clearly that
this expects rcu protection.

Link: https://lore.kernel.org/r/20240906-vfs-hotfix-5959800ffa68@brauner
Fixes: 07fd7c329839 ("libfs: add path_from_stashed()")
Reported-by: [email protected]
Fixes: [email protected]
Reported-by: [email protected]
Fixes: [email protected]
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4
# b381fbbc 15-Aug-2024 Mateusz Guzik <[email protected]>

vfs: elide smp_mb in iversion handling in the common case

According to bpftrace on these routines most calls result in cmpxchg,
which already provides the same guarantee.

In inode_maybe_inc_iversio

vfs: elide smp_mb in iversion handling in the common case

According to bpftrace on these routines most calls result in cmpxchg,
which already provides the same guarantee.

In inode_maybe_inc_iversion elision is possible because even if the
wrong value was read due to now missing smp_mb fence, the issue is going
to correct itself after cmpxchg. If it appears cmpxchg wont be issued,
the fence + reload are there bringing back previous behavior.

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.11-rc3, v6.11-rc2
# 64a7ce76 31-Jul-2024 yangerkun <[email protected]>

libfs: fix infinite directory reads for offset dir

After we switch tmpfs dir operations from simple_dir_operations to
simple_offset_dir_operations, every rename happened will fill new dentry
to dest

libfs: fix infinite directory reads for offset dir

After we switch tmpfs dir operations from simple_dir_operations to
simple_offset_dir_operations, every rename happened will fill new dentry
to dest dir's maple tree(&SHMEM_I(inode)->dir_offsets->mt) with a free
key starting with octx->newx_offset, and then set newx_offset equals to
free key + 1. This will lead to infinite readdir combine with rename
happened at the same time, which fail generic/736 in xfstests(detail show
as below).

1. create 5000 files(1 2 3...) under one dir
2. call readdir(man 3 readdir) once, and get one entry
3. rename(entry, "TEMPFILE"), then rename("TEMPFILE", entry)
4. loop 2~3, until readdir return nothing or we loop too many
times(tmpfs break test with the second condition)

We choose the same logic what commit 9b378f6ad48cf ("btrfs: fix infinite
directory reads") to fix it, record the last_index when we open dir, and
do not emit the entry which index >= last_index. The file->private_data
now used in offset dir can use directly to do this, and we also update
the last_index when we llseek the dir file.

Fixes: a2e459555c5f ("shmem: stable directory offsets")
Signed-off-by: yangerkun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Chuck Lever <[email protected]>
[brauner: only update last_index after seek when offset is zero like Jan suggested]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.11-rc1
# 1da86618 15-Jul-2024 Matthew Wilcox (Oracle) <[email protected]>

fs: Convert aops->write_begin to take a folio

Convert all callers from working on a page to working on one page
of a folio (support for working on an entire folio can come later).
Removes a lot of f

fs: Convert aops->write_begin to take a folio

Convert all callers from working on a page to working on one page
of a folio (support for working on an entire folio can come later).
Removes a lot of folio->page->folio conversions.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.10
# a225800f 10-Jul-2024 Matthew Wilcox (Oracle) <[email protected]>

fs: Convert aops->write_end to take a folio

Most callers have a folio, and most implementations operate on a folio,
so remove the conversion from folio->page->folio to fit through this
interface.

R

fs: Convert aops->write_end to take a folio

Most callers have a folio, and most implementations operate on a folio,
so remove the conversion from folio->page->folio to fit through this
interface.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3
# 6a79a4e1 06-Jun-2024 Gabriel Krisman Bertazi <[email protected]>

libfs: Introduce case-insensitive string comparison helper

generic_ci_match can be used by case-insensitive filesystems to compare
strings under lookup with dirents in a case-insensitive way. This

libfs: Introduce case-insensitive string comparison helper

generic_ci_match can be used by case-insensitive filesystems to compare
strings under lookup with dirents in a case-insensitive way. This
function is currently reimplemented by each filesystem supporting
casefolding, so this reduces code duplication in filesystem-specific
code.

[[email protected]: rework to first test the exact match, cleanup
and add error message]

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: Eugen Hristev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5
# ad191eb6 15-Apr-2024 Chuck Lever <[email protected]>

shmem: Fix shmem_rename2()

When renaming onto an existing directory entry, user space expects
the replacement entry to have the same directory offset as the
original one.

Link: https://gitlab.alpin

shmem: Fix shmem_rename2()

When renaming onto an existing directory entry, user space expects
the replacement entry to have the same directory offset as the
original one.

Link: https://gitlab.alpinelinux.org/alpine/aports/-/issues/15966
Fixes: a2e459555c5f ("shmem: stable directory offsets")
Signed-off-by: Chuck Lever <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 5a1a25be 15-Apr-2024 Chuck Lever <[email protected]>

libfs: Add simple_offset_rename() API

I'm about to fix a tmpfs rename bug that requires the use of
internal simple_offset helpers that are not available in mm/shmem.c

Signed-off-by: Chuck Lever <ch

libfs: Add simple_offset_rename() API

I'm about to fix a tmpfs rename bug that requires the use of
internal simple_offset helpers that are not available in mm/shmem.c

Signed-off-by: Chuck Lever <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 23cdd0ee 15-Apr-2024 Chuck Lever <[email protected]>

libfs: Fix simple_offset_rename_exchange()

User space expects the replacement (old) directory entry to have
the same directory offset after the rename.

Suggested-by: Christian Brauner <brauner@kern

libfs: Fix simple_offset_rename_exchange()

User space expects the replacement (old) directory entry to have
the same directory offset after the rename.

Suggested-by: Christian Brauner <[email protected]>
Fixes: a2e459555c5f ("shmem: stable directory offsets")
Signed-off-by: Chuck Lever <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1
# 9d9539db 12-Mar-2024 Christian Brauner <[email protected]>

pidfs: remove config option

As Linus suggested this enables pidfs unconditionally. A key property to
retain is the ability to compare pidfds by inode number (cf. [1]).
That's extremely helpful just

pidfs: remove config option

As Linus suggested this enables pidfs unconditionally. A key property to
retain is the ability to compare pidfds by inode number (cf. [1]).
That's extremely helpful just as comparing namespace file descriptors by
inode number is. They are used in a variety of scenarios where they need
to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a
socket to trivially authenticate a the sender and various other
use-cases.

For 64bit systems this is pretty trivial to do. For 32bit it's slightly
more annoying as we discussed but we simply add a dumb ida based
allocator that gets used on 32bit. This gives the same guarantees about
inode numbers on 64bit without any overflow risk. Practically, we'll
never run into overflow issues because we're constrained by the number
of processes that can exist on 32bit and by the number of open files
that can exist on a 32bit system. On 64bit none of this matters and
things are very simple.

If 32bit also needs the uniqueness guarantee they can simply parse the
contents of /proc/<pid>/fd/<nr>. The uniqueness guarantees have a
variety of use-cases. One of the most obvious ones is that they will
make pidfiles (or "pidfdfiles", I guess) reliable as the unique
identifier can be placed into there that won't be reycled. Also a
frequent request.

Note, I took the chance and simplified path_from_stashed() even further.
Instead of passing the inode number explicitly to path_from_stashed() we
let the filesystem handle that internally. So path_from_stashed() ends
up even simpler than it is now. This is also a good solution allowing
the cleanup code to be clean and consistent between 32bit and 64bit. The
cleanup path in prepare_anon_dentry() is also switched around so we put
the inode before the dentry allocation. This means we only have to call
the cleanup handler for the filesystem's inode data once and can rely
->evict_inode() otherwise.

Aside from having to have a bit of extra code for 32bit it actually ends
up a nice cleanup for path_from_stashed() imho.

Tested on both 32 and 64bit including error injection.

Link: https://github.com/systemd/systemd/pull/31713 [1]
Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v6.8, v6.8-rc7
# e9c5263c 01-Mar-2024 Christian Brauner <[email protected]>

libfs: improve path_from_stashed()

Right now we pass a bunch of info that is fs specific which doesn't make
a lot of sense and it bleeds fs sepcific details into the generic
helper. nsfs and pidfs h

libfs: improve path_from_stashed()

Right now we pass a bunch of info that is fs specific which doesn't make
a lot of sense and it bleeds fs sepcific details into the generic
helper. nsfs and pidfs have slightly different needs when initializing
inodes. Add simple operations that are stashed in sb->s_fs_info that
both can implement. This also allows us to get rid of cleaning up
references in the caller. All in all path_from_stashed() becomes way
simpler.

Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.8-rc6
# 2558e3b2 21-Feb-2024 Christian Brauner <[email protected]>

libfs: add stashed_dentry_prune()

Both pidfs and nsfs use a memory location to stash a dentry for reuse by
concurrent openers. Right now two custom
dentry->d_prune::{ns,pidfs}_prune_dentry() methods

libfs: add stashed_dentry_prune()

Both pidfs and nsfs use a memory location to stash a dentry for reuse by
concurrent openers. Right now two custom
dentry->d_prune::{ns,pidfs}_prune_dentry() methods are needed that do
the same thing. The only thing that differs is that they need to get to
the memory location to store or retrieve the dentry from differently.
Fix that by remember the stashing location for the dentry in
dentry->d_fsdata which allows us to retrieve it in dentry->d_prune. That
in turn makes it possible to add a common helper that pidfs and nsfs can
both use.

Link: https://lore.kernel.org/r/CAHk-=wg8cHY=i3m6RnXQ2Y2W8psicKWQEZq1=94ivUiviM-0OA@mail.gmail.com
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.8-rc5
# 159a0d9f 18-Feb-2024 Christian Brauner <[email protected]>

libfs: improve path_from_stashed() helper

In earlier patches we moved both nsfs and pidfs to path_from_stashed().
The helper currently tries to add and stash a new dentry if a reusable
dentry couldn

libfs: improve path_from_stashed() helper

In earlier patches we moved both nsfs and pidfs to path_from_stashed().
The helper currently tries to add and stash a new dentry if a reusable
dentry couldn't be found and returns EAGAIN if it lost the race to stash
the dentry. The caller can use EAGAIN to retry.

The helper and the two filesystems be written in a way that makes
returning EAGAIN unnecessary. To do this we need to change the
dentry->d_prune() implementation of nsfs and pidfs to not simply replace
the stashed dentry with NULL but to use a cmpxchg() and only replace
their own dentry.

Then path_from_stashed() can then be changed to not just stash a new
dentry when no dentry is currently stashed but also when an already dead
dentry is stashed. If another task managed to install a dentry in the
meantime it can simply be reused. Pack that into a loop and call it a
day.

Suggested-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=wgtLF5Z5=15-LKAczWm=-tUjHO+Bpf7WjBG+UU3s=fEQw@mail.gmail.com
Signed-off-by: Christian Brauner <[email protected]>

show more ...


12345678910