History log of /linux-6.15/fs/file.c (Results 1 – 25 of 194)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3
# d1f7256a 18-Apr-2025 Mateusz Guzik <[email protected]>

fs: fall back to file_ref_put() for non-last reference

This reduces the slowdown in face of multiple callers issuing close on
what turns out to not be the last reference.

Signed-off-by: Mateusz Guz

fs: fall back to file_ref_put() for non-last reference

This reduces the slowdown in face of multiple callers issuing close on
what turns out to not be the last reference.

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/[email protected]
Reviewed-by: Jan Kara <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.15-rc2, v6.15-rc1, v6.14
# 4dec4f91 20-Mar-2025 Mateusz Guzik <[email protected]>

fs: sort out fd allocation vs dup2 race commentary, take 2

fd_install() has a questionable comment above it.

While it correctly points out a possible race against dup2(), it states:
> We need to de

fs: sort out fd allocation vs dup2 race commentary, take 2

fd_install() has a questionable comment above it.

While it correctly points out a possible race against dup2(), it states:
> We need to detect this and fput() the struct file we are about to
> overwrite in this case.
>
> It should never happen - if we allow dup2() do it, _really_ bad things
> will follow.

I have difficulty parsing the above. The first sentence would suggest
fd_install() tries to detect and recover from the race (it does not),
the next one claims the race needs to be dealt with (it is, by dup2()).

Given that fd_install() does not suffer the burden, this patch removes
the above and instead expands on the race in dup2() commentary.

While here tidy up the docs around fd_install().

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 5370b43e 19-Mar-2025 Mateusz Guzik <[email protected]>

fs: reduce work in fdget_pos()

1. predict the file was found
2. explicitly compare the ref to "one", ignoring the dead zone

The latter arguably improves the behavior to begin with. Suppose the
coun

fs: reduce work in fdget_pos()

1. predict the file was found
2. explicitly compare the ref to "one", ignoring the dead zone

The latter arguably improves the behavior to begin with. Suppose the
count turned bad -- the previously used ref routine is going to check
for it and return 0, indicating the count does not necessitate taking
->f_pos_lock. But there very well may be several users.

i.e. not paying for special-casing the dead zone improves semantics.

While here spell out each condition in a dedicated if statement. This
has no effect on generated code.

Sizes are as follows (in bytes; gcc 13, x86-64):
stock: 321
likely(): 298
likely()+ref: 280

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc7
# f381640e 13-Mar-2025 Mateusz Guzik <[email protected]>

fs: consistently deref the files table with rcu_dereference_raw()

... except when the table is known to be only used by one thread.

A file pointer can get installed at any moment despite the ->file

fs: consistently deref the files table with rcu_dereference_raw()

... except when the table is known to be only used by one thread.

A file pointer can get installed at any moment despite the ->file_lock
being held since the following:
8a81252b774b53e6 ("fs/file.c: don't acquire files->file_lock in fd_install()")

Accesses subject to such a race can in principle suffer load tearing.

While here redo the comment in dup_fd -- it only covered a race against
files showing up, still assuming fd_install() takes the lock.

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# dc530c44 12-Mar-2025 Mateusz Guzik <[email protected]>

fs: use debug-only asserts around fd allocation and install

This also restores the check which got removed in 52732bb9abc9ee5b
("fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()"

fs: use debug-only asserts around fd allocation and install

This also restores the check which got removed in 52732bb9abc9ee5b
("fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()")
for performance reasons -- they no longer apply with a debug-only
variant.

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc6
# e8358845 05-Mar-2025 Mateusz Guzik <[email protected]>

file: add fput and file_ref_put routines optimized for use when closing a fd

Vast majority of the time closing a file descriptor also operates on the
last reference, where a regular fput usage will

file: add fput and file_ref_put routines optimized for use when closing a fd

Vast majority of the time closing a file descriptor also operates on the
last reference, where a regular fput usage will result in 2 atomics.
This can be changed to only suffer 1.

See commentary above file_ref_put_close() for more information.

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2
# da06e3c5 07-Feb-2025 Christian Brauner <[email protected]>

fs: don't needlessly acquire f_lock

Before 2011 there was no meaningful synchronization between
read/readdir/write/seek. Only in commit
ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")

fs: don't needlessly acquire f_lock

Before 2011 there was no meaningful synchronization between
read/readdir/write/seek. Only in commit
ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
synchronization was added for SEEK_CUR by taking f_lock around
vfs_setpos().

Then in 2014 full synchronization between read/readdir/write/seek was
added in commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX")
by introducing f_pos_lock for regular files with FMODE_ATOMIC_POS and
for directories. At that point taking f_lock became unnecessary for such
files.

So only acquire f_lock for SEEK_CUR if this isn't a file that would have
acquired f_pos_lock if necessary.

Link: https://lore.kernel.org/r/20250207-daten-mahlzeit-99d2079864fb@brauner
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2
# ec052fae 05-Dec-2024 Mateusz Guzik <[email protected]>

fs: sort out a stale comment about races between fd alloc and dup2

It claims the issue is only relevant for shared descriptor tables which
is of no concern for POSIX (but then is POSIX of concern to

fs: sort out a stale comment about races between fd alloc and dup2

It claims the issue is only relevant for shared descriptor tables which
is of no concern for POSIX (but then is POSIX of concern to anyone
today?), which I presume predates standarized threading.

The comment also mentions the following systems:
- OpenBSD installing a larval file -- they moved away from it, file is
installed late and EBUSY is returned on conflict
- FreeBSD returning EBADF -- reworked to install the file early like
OpenBSD used to do
- NetBSD "deadlocks in amusing ways" -- their solution looks
Solaris-inspired (not a compliment) and I would not be particularly
surprised if it indeed deadlocked, in amusing ways or otherwise

I don't believe mentioning any of these adds anything and the statement
about the issue not being POSIX-relevant is outdated.

dup2 description in POSIX still does not mention the problem.

Just shorten the comment and be done with it.

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 2b2fc0be 17-Dec-2024 Zhang Kunbo <[email protected]>

fs: fix missing declaration of init_files

fs/file.c should include include/linux/init_task.h for
declaration of init_files. This fixes the sparse warning:

fs/file.c:501:21: warning: symbol 'init_

fs: fix missing declaration of init_files

fs/file.c should include include/linux/init_task.h for
declaration of init_files. This fixes the sparse warning:

fs/file.c:501:21: warning: symbol 'init_files' was not declared. Should it be static?

Signed-off-by: Zhang Kunbo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.13-rc1, v6.12
# a48bdf80 16-Nov-2024 Mateusz Guzik <[email protected]>

fs: delay sysctl_nr_open check in expand_files()

Suppose a thread sharing the table started a resize, while
sysctl_nr_open got lowered to a value which prohibits it. This is still
going to go throug

fs: delay sysctl_nr_open check in expand_files()

Suppose a thread sharing the table started a resize, while
sysctl_nr_open got lowered to a value which prohibits it. This is still
going to go through with and without the patch, which is fine.

Further suppose another thread shows up to do a matching expansion while
resize_in_progress == true. It is going to error out since it performs
the sysctl_nr_open check *before* finding out if there is an expansion
in progress. But the aformentioned thread is going to succeded, so the
error is spurious (and it would not happen if the thread showed up a
little bit later).

Checking the sysctl *after* we know there are no pending updates sorts
it out.

While here annotate the thing as unlikely.

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3
# 90ee6ed7 07-Oct-2024 Christian Brauner <[email protected]>

fs: port files to file_ref

Port files to rely on file_ref reference to improve scaling and gain
overflow protection.

- We continue to WARN during get_file() in case a file that is already
marked

fs: port files to file_ref

Port files to rely on file_ref reference to improve scaling and gain
overflow protection.

- We continue to WARN during get_file() in case a file that is already
marked dead is revived as get_file() is only valid if the caller
already holds a reference to the file. This hasn't changed just the
check changes.

- The semantics for epoll and ttm's dmabuf usage have changed. Both
epoll and ttm synchronize with __fput() to prevent the underlying file
from beeing freed.

(1) epoll

Explaining epoll is straightforward using a simple diagram.
Essentially, the mutex of the epoll instance needs to be taken in both
__fput() and around epi_fget() preventing the file from being freed
while it is polled or preventing the file from being resurrected.

CPU1 CPU2
fput(file)
-> __fput(file)
-> eventpoll_release(file)
-> eventpoll_release_file(file)
mutex_lock(&ep->mtx)
epi_item_poll()
-> epi_fget()
-> file_ref_get(file)
mutex_unlock(&ep->mtx)
mutex_lock(&ep->mtx);
__ep_remove()
mutex_unlock(&ep->mtx);
-> kmem_cache_free(file)

(2) ttm dmabuf

This explanation is a bit more involved. A regular dmabuf file stashed
the dmabuf in file->private_data and the file in dmabuf->file:

file->private_data = dmabuf;
dmabuf->file = file;

The generic release method of a dmabuf file handles file specific
things:

f_op->release::dma_buf_file_release()

while the generic dentry release method of a dmabuf handles dmabuf
freeing including driver specific things:

dentry->d_release::dma_buf_release()

During ttm dmabuf initialization in ttm_object_device_init() the ttm
driver copies the provided struct dma_buf_ops into a private location:

struct ttm_object_device {
spinlock_t object_lock;
struct dma_buf_ops ops;
void (*dmabuf_release)(struct dma_buf *dma_buf);
struct idr idr;
};

ttm_object_device_init(const struct dma_buf_ops *ops)
{
// copy original dma_buf_ops in private location
tdev->ops = *ops;

// stash the release method of the original struct dma_buf_ops
tdev->dmabuf_release = tdev->ops.release;

// override the release method in the copy of the struct dma_buf_ops
// with ttm's own dmabuf release method
tdev->ops.release = ttm_prime_dmabuf_release;
}

When a new dmabuf is created the struct dma_buf_ops with the overriden
release method set to ttm_prime_dmabuf_release is passed in exp_info.ops:

DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
exp_info.ops = &tdev->ops;
exp_info.size = prime->size;
exp_info.flags = flags;
exp_info.priv = prime;

The call to dma_buf_export() then sets

mutex_lock_interruptible(&prime->mutex);
dma_buf = dma_buf_export(&exp_info)
{
dmabuf->ops = exp_info->ops;
}
mutex_unlock(&prime->mutex);

which creates a new dmabuf file and then install a file descriptor to
it in the callers file descriptor table:

ret = dma_buf_fd(dma_buf, flags);

When that dmabuf file is closed we now get:

fput(file)
-> __fput(file)
-> f_op->release::dma_buf_file_release()
-> dput()
-> d_op->d_release::dma_buf_release()
-> dmabuf->ops->release::ttm_prime_dmabuf_release()
mutex_lock(&prime->mutex);
if (prime->dma_buf == dma_buf)
prime->dma_buf = NULL;
mutex_unlock(&prime->mutex);

Where we can see that prime->dma_buf is set to NULL. So when we have
the following diagram:

CPU1 CPU2
fput(file)
-> __fput(file)
-> f_op->release::dma_buf_file_release()
-> dput()
-> d_op->d_release::dma_buf_release()
-> dmabuf->ops->release::ttm_prime_dmabuf_release()
ttm_prime_handle_to_fd()
mutex_lock_interruptible(&prime->mutex)
dma_buf = prime->dma_buf
dma_buf && get_dma_buf_unless_doomed(dma_buf)
-> file_ref_get(dma_buf->file)
mutex_unlock(&prime->mutex);

mutex_lock(&prime->mutex);
if (prime->dma_buf == dma_buf)
prime->dma_buf = NULL;
mutex_unlock(&prime->mutex);
-> kmem_cache_free(file)

The logic of the mechanism is the same as for epoll: sync with
__fput() preventing the file from being freed. Here the
synchronization happens through the ttm instance's prime->mutex.
Basically, the lifetime of the dma_buf and the file are tighly
coupled.

Both (1) and (2) used to call atomic_inc_not_zero() to check whether
the file has already been marked dead and then refuse to revive it.

This is only safe because both (1) and (2) sync with __fput() and thus
prevent kmem_cache_free() on the file being called and thus prevent
the file from being immediately recycled due to SLAB_TYPESAFE_BY_RCU.

Both (1) and (2) have been ported from atomic_inc_not_zero() to
file_ref_get(). That means a file that is already in the process of
being marked as FILE_REF_DEAD:

file_ref_put()
cnt = atomic_long_dec_return()
-> __file_ref_put(cnt)
if (cnt == FIlE_REF_NOREF)
atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

can be revived again:

CPU1 CPU2
file_ref_put()
cnt = atomic_long_dec_return()
-> __file_ref_put(cnt)
if (cnt == FIlE_REF_NOREF)
file_ref_get()
// Brings reference back to FILE_REF_ONEREF
atomic_long_add_negative()
atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

This is fine and inherent to the file_ref_get()/file_ref_put()
semantics. For both (1) and (2) this is safe because __fput() is
prevented from making progress if file_ref_get() fails due to the
aforementioned synchronization mechanisms.

Two cases need to be considered that affect both (1) epoll and (2) ttm
dmabuf:

(i) fput()'s file_ref_put() and marks the file as FILE_REF_NOREF but
before that fput() can mark the file as FILE_REF_DEAD someone
manages to sneak in a file_ref_get() and brings the refcount back
from FILE_REF_NOREF to FILE_REF_ONEREF. In that case the original
fput() doesn't call __fput(). For epoll the poll will finish and
for ttm dmabuf the file can be used again. For ttm dambuf this is
actually an advantage because it avoids immediately allocating
a new dmabuf object.

CPU1 CPU2
file_ref_put()
cnt = atomic_long_dec_return()
-> __file_ref_put(cnt)
if (cnt == FIlE_REF_NOREF)
file_ref_get()
// Brings reference back to FILE_REF_ONEREF
atomic_long_add_negative()
atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

(ii) fput()'s file_ref_put() marks the file FILE_REF_NOREF and
also suceeds in actually marking it FILE_REF_DEAD and then calls
into __fput() to free the file.

When either (1) or (2) call file_ref_get() they fail as
atomic_long_add_negative() will return true.

At the same time, both (1) and (2) all file_ref_get() under
mutexes that __fput() must also acquire preventing
kmem_cache_free() from freeing the file.

So while this might be treated as a change in semantics for (1) and
(2) it really isn't. It if should end up causing issues this can be
fixed by adding a helper that does something like:

long cnt = atomic_long_read(&ref->refcnt);
do {
if (cnt < 0)
return false;
} while (!atomic_long_try_cmpxchg(&ref->refcnt, &cnt, cnt + 1));
return true;

which would block FILE_REF_NOREF to FILE_REF_ONEREF transitions.

- Jann correctly pointed out that kmem_cache_zalloc() cannot be used
anymore once files have been ported to file_ref_t.

The kmem_cache_zalloc() call will memset() the whole struct file to
zero when it is reallocated. This will also set file->f_ref to zero
which mens that a concurrent file_ref_get() can return true:

CPU1 CPU2
__get_file_rcu()
rcu_dereference_raw()
close()
[frees file]
alloc_empty_file()
kmem_cache_zalloc()
[reallocates same file]
memset(..., 0, ...)
file_ref_get()
[increments 0->1, returns true]
init_file()
file_ref_init(..., 1)
[sets to 0]
rcu_dereference_raw()
fput()
file_ref_put()
[decrements 0->FILE_REF_NOREF, frees file]
[UAF]

causing a concurrent __get_file_rcu() call to acquire a reference to
the file that is about to be reallocated and immediately freeing it
on realizing that it has been recycled. This causes a UAF for the
task that reallocated/recycled the file.

This is prevented by switching from kmem_cache_zalloc() to
kmem_cache_alloc() and initializing the fields manually. With
file->f_ref initialized last.

Note that a memset() also isn't guaranteed to atomically update an
unsigned long so it's theoretically possible to see torn and
therefore bogus counter values.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 08ef26ea 07-Oct-2024 Christian Brauner <[email protected]>

fs: add file_ref

As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
O(N^2) behaviour under contention with N concurrent operations and it is
in a hot path in __fget_files_rcu()

fs: add file_ref

As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
O(N^2) behaviour under contention with N concurrent operations and it is
in a hot path in __fget_files_rcu().

The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make this
work and requiring rcu protection for the data structure in question.
This not just scales better it also introduces overflow protection.

However, in contrast to generic rcuref, files require a memory barrier
and thus cannot rely on *_relaxed() atomic operations and also require
to be built on atomic_long_t as having massive amounts of reference
isn't unheard of even if it is just an attack.

As suggested by Linus, add a file specific variant instead of making
this a generic library.

Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu
protection. In short, freeing of files isn't delayed until a grace
period has elapsed. Instead, they are freed immediately and thus can be
reused (multiple times) within the same grace period.

So when picking a file from the file descriptor table via its file
descriptor number it is thus possible to see an elevated reference count
on file->f_count even though the file has already been recycled possibly
multiple times by another task.

To guard against this the vfs will pick the file from the file
descriptor table twice. Once before the refcount increment and once
after to compare the pointers (grossly simplified). If they match then
the file is still valid. If not the caller needs to fput() it.

The unconditional increment makes the following race possible as
illustrated by rcuref:

> Deconstruction race
> ===================
>
> The release operation must be protected by prohibiting a grace period in
> order to prevent a possible use after free:
>
> T1 T2
> put() get()
> // ref->refcnt = ONEREF
> if (!atomic_add_negative(-1, &ref->refcnt))
> return false; <- Not taken
>
> // ref->refcnt == NOREF
> --> preemption
> // Elevates ref->refcnt to ONEREF
> if (!atomic_add_negative(1, &ref->refcnt))
> return true; <- taken
>
> if (put(&p->ref)) { <-- Succeeds
> remove_pointer(p);
> kfree_rcu(p, rcu);
> }
>
> RCU grace period ends, object is freed
>
> atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF
>
> [...] it prevents the grace period which keeps the object alive until
> all put() operations complete.

Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for
this deconstruction race. Afaict, the only interesting case would be
someone freeing the file and someone immediately recycling it within the
same grace period and reinitializing file->f_count to ONEREF while a
concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as
in the race above.

But this is safe from SLAB_TYPESAFE_BY_RCU's perspective and it should
be safe from rcuref's perspective.

T1 T2 T3
fput() fget()
// f_count->refcnt = ONEREF
if (!atomic_add_negative(-1, &f_count->refcnt))
return false; <- Not taken

// f_count->refcnt == NOREF
--> preemption
// Elevates f_count->refcnt to ONEREF
if (!atomic_add_negative(1, &f_count->refcnt))
return true; <- taken

if (put(&f_count)) { <-- Succeeds
remove_pointer(p);
/*
* Cache is SLAB_TYPESAFE_BY_RCU
* so this is freed without a grace period.
*/
kmem_cache_free(p);
}

kmem_cache_alloc()
init_file() {
// Sets f_count->refcnt to ONEREF
rcuref_long_init(&f->f_count, 1);
}

Object has been reused within the same grace period
via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU.

/*
* With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and
* it would work correctly because the atomic_cmpxchg()
* will fail because the refcount has been reset to ONEREF by T3.
*/
atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF

However, there are other cases to consider:

(1) Benign race due to multiple atomic_long_read()

CPU1 CPU2

file_ref_put()
// last reference
// => count goes negative/FILE_REF_NOREF
atomic_long_add_negative_release(-1, &ref->refcnt)
-> __file_ref_put()
file_ref_get()
// goes back from negative/FILE_REF_NOREF to 0
// and file_ref_get() succeeds
atomic_long_add_negative(1, &ref->refcnt)

// This is immediately followed by file_ref_put()
// managing to set FILE_REF_DEAD
file_ref_put()

// __file_ref_put() continues and sees
// cnt > FILE_REF_RELEASED // and splats with
// "imbalanced put on file reference count"
cnt = atomic_long_read(&ref->refcnt);

The race however is benign and the problem is the
atomic_long_read(). Instead of performing a separate read this uses
atomic_long_dec_return() and pass the value to __file_ref_put().
Thanks to Linus for pointing out that braino.

(2) SLAB_TYPESAFE_BY_RCU may cause recycled files to be marked dead

When a file is recycled the following race exists:

CPU1 CPU2
// @file is already dead and thus
// cnt >= FILE_REF_RELEASED.
file_ref_get(file)
atomic_long_add_negative(1, &ref->refcnt)
// We thus call into __file_ref_get()
-> __file_ref_get()

// which sees cnt >= FILE_REF_RELEASED
cnt = atomic_long_read(&ref->refcnt);
// In the meantime @file gets freed
kmem_cache_free()

// and is immediately recycled
file = kmem_cache_zalloc()
// and the reference count is reinitialized
// and the file alive again in someone
// else's file descriptor table
file_ref_init(&ref->refcnt, 1);

// the __file_ref_get() slowpath now continues
// and as it saw earlier that cnt >= FILE_REF_RELEASED
// it wants to ensure that we're staying in the middle
// of the deadzone and unconditionally sets
// FILE_REF_DEAD.
// This marks @file dead for CPU2...
atomic_long_set(&ref->refcnt, FILE_REF_DEAD);

// Caller issues a close() system call to close @file
close(fd)
file = file_close_fd_locked()
filp_flush()
// The caller sees that cnt >= FILE_REF_RELEASED
// and warns the first time...
CHECK_DATA_CORRUPTION(file_count(file) == 0)

// and then splats a second time because
// __file_ref_put() sees cnt >= FILE_REF_RELEASED
file_ref_put(&ref->refcnt);
-> __file_ref_put()

My initial inclination was to replace the unconditional
atomic_long_set() with an atomic_long_try_cmpxchg() but Linus
pointed out that:

> I think we should just make file_ref_get() do a simple
>
> return !atomic_long_add_negative(1, &ref->refcnt));
>
> and nothing else. Yes, multiple CPU's can race, and you can increment
> more than once, but the gap - even on 32-bit - between DEAD and
> becoming close to REF_RELEASED is so big that we simply don't care.
> That's the point of having a gap.

I've been testing this with will-it-scale using fstat() on a machine
that Jens gave me access (thank you very much!):

processor : 511
vendor_id : AuthenticAMD
cpu family : 25
model : 160
model name : AMD EPYC 9754 128-Core Processor

and I consistently get a 3-5% improvement on 256+ threads.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Closes: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4
# 6a8126f0 14-Aug-2024 Al Viro <[email protected]>

expand_files(): simplify calling conventions

All callers treat 0 and 1 returned by expand_files() in the same way
now since the call in alloc_fd() had been made conditional. Just make
it return 0 o

expand_files(): simplify calling conventions

All callers treat 0 and 1 returned by expand_files() in the same way
now since the call in alloc_fd() had been made conditional. Just make
it return 0 on success and be done with it...

Signed-off-by: Al Viro <[email protected]>

show more ...


# b8ea429d 21-Aug-2024 Al Viro <[email protected]>

make __set_open_fd() set cloexec state as well

->close_on_exec[] state is maintained only for opened descriptors;
as the result, anything that marks a descriptor opened has to
set its cloexec state

make __set_open_fd() set cloexec state as well

->close_on_exec[] state is maintained only for opened descriptors;
as the result, anything that marks a descriptor opened has to
set its cloexec state explicitly.

As the result, all calls of __set_open_fd() are followed by
__set_close_on_exec(); might as well fold it into __set_open_fd()
so that cloexec state is defined as soon as the descriptor is
marked opened.

[braino fix folded]

Signed-off-by: Al Viro <[email protected]>

show more ...


# e880d33b 14-Aug-2024 Al Viro <[email protected]>

file.c: merge __{set,clear}_close_on_exec()

they are always go in pairs; seeing that they are inlined, might
as well make that a single inline function taking a boolean
argument ("do we want close_o

file.c: merge __{set,clear}_close_on_exec()

they are always go in pairs; seeing that they are inlined, might
as well make that a single inline function taking a boolean
argument ("do we want close_on_exec set for that descriptor")

Signed-off-by: Al Viro <[email protected]>

show more ...


Revision tags: v6.11-rc3
# 1d3b4bec 07-Aug-2024 Al Viro <[email protected]>

alloc_fdtable(): change calling conventions.

First of all, tell it how many slots do we want, not which slot
is wanted. It makes one caller (dup_fd()) more straightforward
and doesn't harm another

alloc_fdtable(): change calling conventions.

First of all, tell it how many slots do we want, not which slot
is wanted. It makes one caller (dup_fd()) more straightforward
and doesn't harm another (expand_fdtable()).

Furthermore, make it return ERR_PTR() on failure rather than
returning NULL. Simplifies the callers.

Simplify the size calculation, while we are at it - note that we
always have slots_wanted greater than BITS_PER_LONG. What the
rules boil down to is
* use the smallest power of two large enough to give us
that many slots
* on 32bit skip 64 and 128 - the minimal capacity we want
there is 256 slots (i.e. 1Kb fd array).
* on 64bit don't skip anything, the minimal capacity is
128 - and we'll never be asked for 64 or less. 128 slots means
1Kb fd array, again.
* on 128bit, if that ever happens, don't skip anything -
we'll never be asked for 128 or less, so the fd array allocation
will be at least 2Kb.

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


Revision tags: v6.11-rc2, v6.11-rc1
# 0c40bf47 17-Jul-2024 Yu Ma <[email protected]>

fs/file.c: add fast path in find_next_fd()

Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the fi

fs/file.c: add fast path in find_next_fd()

Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.

This fast path is proposed by Mateusz Guzik <[email protected]>, and agreed by
Jan Kara <[email protected]>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.

Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


# c9a30196 17-Jul-2024 Yu Ma <[email protected]>

fs/file.c: conditionally clear full_fds

64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s

fs/file.c: conditionally clear full_fds

64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.

Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


# 52732bb9 17-Jul-2024 Yu Ma <[email protected]>

fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check s

fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.

Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2
# cab05152 02-Jun-2024 Al Viro <[email protected]>

move close_range(2) into fs/file.c, fold __close_range() into it

We never had callers for __close_range() except for close_range(2)
itself. Nothing of that sort has appeared in four years and if a

move close_range(2) into fs/file.c, fold __close_range() into it

We never had callers for __close_range() except for close_range(2)
itself. Nothing of that sort has appeared in four years and if any users
do show up, we can always separate those suckers again.

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


# 1fa4ffd8 01-Aug-2024 Al Viro <[email protected]>

close_files(): don't bother with xchg()

At that point nobody else has references to the victim files_struct;
as the matter of fact, the caller will free it immediately after
close_files() returns, w

close_files(): don't bother with xchg()

At that point nobody else has references to the victim files_struct;
as the matter of fact, the caller will free it immediately after
close_files() returns, with no RCU delays or anything of that sort.

That's why we are not protecting against fdtable reallocation on
expansion, not cleaning the bitmaps, etc. There's no point
zeroing the pointers in ->fd[] either, let alone make that an
atomic operation.

Signed-off-by: Al Viro <[email protected]>

show more ...


# 8fd3395e 31-Jul-2024 Al Viro <[email protected]>

get rid of ...lookup...fdget_rcu() family

Once upon a time, predecessors of those used to do file lookup
without bumping a refcount, provided that caller held rcu_read_lock()
across the lookup and w

get rid of ...lookup...fdget_rcu() family

Once upon a time, predecessors of those used to do file lookup
without bumping a refcount, provided that caller held rcu_read_lock()
across the lookup and whatever it wanted to read from the struct
file found. When struct file allocation switched to SLAB_TYPESAFE_BY_RCU,
that stopped being feasible and these primitives started to bump the
file refcount for lookup result, requiring the caller to call fput()
afterwards.

But that turned them pointless - e.g.
rcu_read_lock();
file = lookup_fdget_rcu(fd);
rcu_read_unlock();
is equivalent to
file = fget_raw(fd);
and all callers of lookup_fdget_rcu() are of that form. Similarly,
task_lookup_fdget_rcu() calls can be replaced with calling fget_task().
task_lookup_next_fdget_rcu() doesn't have direct counterparts, but
its callers would be happier if we replaced it with an analogue that
deals with RCU internally.

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


# 85184982 15-Sep-2024 Wedson Almeida Filho <[email protected]>

rust: file: add Rust abstraction for `struct file`

This abstraction makes it possible to manipulate the open files for a
process. The new `File` struct wraps the C `struct file`. When accessing
it u

rust: file: add Rust abstraction for `struct file`

This abstraction makes it possible to manipulate the open files for a
process. The new `File` struct wraps the C `struct file`. When accessing
it using the smart pointer `ARef<File>`, the pointer will own a
reference count to the file. When accessing it as `&File`, then the
reference does not own a refcount, but the borrow checker will ensure
that the reference count does not hit zero while the `&File` is live.

Since this is intended to manipulate the open files of a process, we
introduce an `fget` constructor that corresponds to the C `fget`
method. In future patches, it will become possible to create a new fd in
a process and bind it to a `File`. Rust Binder will use these to send
fds from one process to another.

We also provide a method for accessing the file's flags. Rust Binder
will use this to access the flags of the Binder fd to check whether the
non-blocking flag is set, which affects what the Binder ioctl does.

This introduces a struct for the EBADF error type, rather than just
using the Error type directly. This has two advantages:
* `File::fget` returns a `Result<ARef<File>, BadFdError>`, which the
compiler will represent as a single pointer, with null being an error.
This is possible because the compiler understands that `BadFdError`
has only one possible value, and it also understands that the
`ARef<File>` smart pointer is guaranteed non-null.
* Additionally, we promise to users of the method that the method can
only fail with EBADF, which means that they can rely on this promise
without having to inspect its implementation.
That said, there are also two disadvantages:
* Defining additional error types involves boilerplate.
* The question mark operator will only utilize the `From` trait once,
which prevents you from using the question mark operator on
`BadFdError` in methods that return some third error type that the
kernel `Error` is convertible into. (However, it works fine in methods
that return `Error`.)

Signed-off-by: Wedson Almeida Filho <[email protected]>
Co-developed-by: Daniel Xu <[email protected]>
Signed-off-by: Daniel Xu <[email protected]>
Co-developed-by: Alice Ryhl <[email protected]>
Reviewed-by: Benno Lossin <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Gary Guo <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 678379e1 16-Aug-2024 Al Viro <[email protected]>

close_range(): fix the logics in descriptor table trimming

Cloning a descriptor table picks the size that would cover all currently
opened files. That's fine for clone() and unshare(), but for clos

close_range(): fix the logics in descriptor table trimming

Cloning a descriptor table picks the size that would cover all currently
opened files. That's fine for clone() and unshare(), but for close_range()
there's an additional twist - we clone before we close, and it would be
a shame to have
close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
leave us with a huge descriptor table when we are not going to keep
anything past stderr, just because some large file descriptor used to
be open before our call has taken it out.

Unfortunately, it had been dealt with in an inherently racy way -
sane_fdtable_size() gets a "don't copy anything past that" argument
(passed via unshare_fd() and dup_fd()), close_range() decides how much
should be trimmed and passes that to unshare_fd().

The problem is, a range that used to extend to the end of descriptor
table back when close_range() had looked at it might very well have stuff
grown after it by the time dup_fd() has allocated a new files_struct
and started to figure out the capacity of fdtable to be attached to that.

That leads to interesting pathological cases; at the very least it's a
QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
a snapshot of descriptor table one might have observed at some point.
Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
of unshare(CLONE_FILES) with plain close_range(), ending up with a
weird state that would never occur with unshare(2) is confusing, to put
it mildly.

It's not hard to get rid of - all it takes is passing both ends of the
range down to sane_fdtable_size(). There we are under ->files_lock,
so the race is trivially avoided.

So we do the following:
* switch close_files() from calling unshare_fd() to calling
dup_fd().
* undo the calling convention change done to unshare_fd() in
60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
* introduce struct fd_range, pass a pointer to that to dup_fd()
and sane_fdtable_size() instead of "trim everything past that point"
they are currently getting. NULL means "we are not going to be punching
any holes"; NR_OPEN_MAX is gone.
* make sane_fdtable_size() use find_last_bit() instead of
open-coding it; it's easier to follow that way.
* while we are at it, have dup_fd() report errors by returning
ERR_PTR(), no need to use a separate int *errorp argument.

Fixes: 60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
Cc: [email protected]
Signed-off-by: Al Viro <[email protected]>

show more ...


# 215ab0d8 03-Aug-2024 Joel Savitz <[email protected]>

file: remove outdated comment after close_fd()

Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: [email protected]

file: remove outdated comment after close_fd()

Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: [email protected]

The comment on EXPORT_SYMBOL(close_fd) was added in commit 2ca2a09d6215
("fs: add ksys_close() wrapper; remove in-kernel calls to sys_close()"),
before commit 8760c909f54a ("file: Rename __close_fd to close_fd and remove
the files parameter") gave the function its current name, however commit
1572bfdf21d4 ("file: Replace ksys_close with close_fd") removes the
referenced caller entirely, obsoleting this comment.

Signed-off-by: Joel Savitz <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


12345678