|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
0fb48272 |
| 20-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: improve multi-threaded exec and premature thread-group leader exit polling
This is another attempt trying to make pidfd polling for multi-threaded exec and premature thread-group leader exit
pidfs: improve multi-threaded exec and premature thread-group leader exit polling
This is another attempt trying to make pidfd polling for multi-threaded exec and premature thread-group leader exit consistent.
A quick recap of these two cases:
(1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs.
(2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited.
Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader.
The correct behavior would be to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive.
But this is difficult to handle because a thread-group may exit prematurely as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point.
So far there was no way to distinguish between (1) and (2) internally. This tiny series tries to address this problem by discarding PIDFD_THREAD notification on premature thread-group leader exit.
If that works correctly then no exit notifications are generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec aftewards no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle.
Co-Developed-by: Oleg Nesterov <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7 |
|
| #
68db2727 |
| 16-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: ensure that PIDFS_INFO_EXIT is available
When we currently create a pidfd we check that the task hasn't been reaped right before we create the pidfd. But it is of course possible that by the
pidfs: ensure that PIDFS_INFO_EXIT is available
When we currently create a pidfd we check that the task hasn't been reaped right before we create the pidfd. But it is of course possible that by the time we return the pidfd to userspace the task has already been reaped since we don't check again after having created a dentry for it.
This was fine until now because that race was meaningless. But now that we provide PIDFD_INFO_EXIT it is a problem because it is possible that the kernel returns a reaped pidfd and it depends on the race whether PIDFD_INFO_EXIT information is available. This depends on if the task gets reaped before or after a dentry has been attached to struct pid.
Make this consistent and only returned pidfds for reaped tasks if PIDFD_INFO_EXIT information is available. This is done by performing another check whether the task has been reaped right after we attached a dentry to struct pid.
Since pidfs_exit() is called before struct pid's task linkage is removed the case where the task got reaped but a dentry was already attached to struct pid and exit information was recorded and published can be handled correctly. In that case we do return a pidfd for a reaped task like we would've before.
Link: https://lore.kernel.org/r/20250316-kabel-fehden-66bdb6a83436@brauner Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
7477d7dc |
| 05-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: allow to retrieve exit information
Some tools like systemd's jounral need to retrieve the exit and cgroup information after a process has already been reaped. This can e.g., happen when retri
pidfs: allow to retrieve exit information
Some tools like systemd's jounral need to retrieve the exit and cgroup information after a process has already been reaped. This can e.g., happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-6-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
45135229 |
| 05-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: record exit code and cgroupid at exit
Record the exit code and cgroupid in release_task() and stash in struct pidfs_exit_info so it can be retrieved even after the task has been reaped.
Link
pidfs: record exit code and cgroupid at exit
Record the exit code and cgroupid in release_task() and stash in struct pidfs_exit_info so it can be retrieved even after the task has been reaped.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
0b420038 |
| 05-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: use private inode slab cache
Introduce a private inode slab cache for pidfs. In follow-up patches pidfs will gain the ability to provide exit information to userspace after the task has been
pidfs: use private inode slab cache
Introduce a private inode slab cache for pidfs. In follow-up patches pidfs will gain the ability to provide exit information to userspace after the task has been reaped. This means storing exit information even after the task has already been released and struct pid's task linkage is gone. Store that information alongside the inode.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-4-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
3155a194 |
| 05-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: move setting flags into pidfs_alloc_file()
Instead od adding it into __pidfd_prepare() place it where the actual file allocation happens and update the outdated comment.
Link: https://lore.k
pidfs: move setting flags into pidfs_alloc_file()
Instead od adding it into __pidfd_prepare() place it where the actual file allocation happens and update the outdated comment.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-3-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
816b2e60 |
| 05-Mar-2025 |
Christian Brauner <[email protected]> |
pidfs: switch to copy_struct_to_user()
We have a helper that deals with all the required logic.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-1-c8c3d8361705@kernel.org R
pidfs: switch to copy_struct_to_user()
We have a helper that deals with all the required logic.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-1-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5 |
|
| #
02cfe2b6 |
| 24-Feb-2025 |
Christian Brauner <[email protected]> |
pidfs: remove d_op->d_delete
Pidfs only deals with unhashed dentries and there's currently no way for them to become hashed. So remove d_op->d_delete.
Signed-off-by: Christian Brauner <brauner@kern
pidfs: remove d_op->d_delete
Pidfs only deals with unhashed dentries and there's currently no way for them to become hashed. So remove d_op->d_delete.
Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
091ee63e |
| 04-Feb-2025 |
Christian Brauner <[email protected]> |
pidfs: improve ioctl handling
Pidfs supports extensible and non-extensible ioctls. The extensible ioctls need to check for the ioctl number itself not just the ioctl command otherwise both backward-
pidfs: improve ioctl handling
Pidfs supports extensible and non-extensible ioctls. The extensible ioctls need to check for the ioctl number itself not just the ioctl command otherwise both backward- and forward compatibility are broken.
The pidfs ioctl handler also needs to look at the type of the ioctl command to guard against cases where "[...] a daemon receives some random file descriptor from a (potentially less privileged) client and expects the FD to be of some specific type, it might call ioctl() on this FD with some type-specific command and expect the call to fail if the FD is of the wrong type; but due to the missing type check, the kernel instead performs some action that userspace didn't expect." (cf. [1]]
Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/CAG48ez2K9A5GwtgqO31u9ZL292we8ZwAA=TJwwEv7wRuJ3j4Lw@mail.gmail.com [1] Fixes: 8ce352818820 ("pidfs: check for valid ioctl commands") Acked-by: Luca Boccassi <[email protected]> Reported-by: Jann Horn <[email protected]> Cc: [email protected] # v6.13; please backport with 8ce352818820 ("pidfs: check for valid ioctl commands") Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4 |
|
| #
ef4144ac |
| 19-Dec-2024 |
Christian Brauner <[email protected]> |
pidfs: allow bind-mounts
Allow bind-mounting pidfds. Similar to nsfs let's allow bind-mounts for pidfds. This allows pidfds to be safely recovered and checked for process recycling.
Link: https://l
pidfs: allow bind-mounts
Allow bind-mounting pidfds. Similar to nsfs let's allow bind-mounts for pidfds. This allows pidfds to be safely recovered and checked for process recycling.
Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc3 |
|
| #
16ecd47c |
| 14-Dec-2024 |
Christian Brauner <[email protected]> |
pidfs: lookup pid through rbtree
The new pid inode number allocation scheme is neat but I overlooked a possible, even though unlikely, attack that can be used to trigger an overflow on both 32bit an
pidfs: lookup pid through rbtree
The new pid inode number allocation scheme is neat but I overlooked a possible, even though unlikely, attack that can be used to trigger an overflow on both 32bit and 64bit.
An unique 64 bit identifier was constructed for each struct pid by two combining a 32 bit idr with a 32 bit generation number. A 32bit number was allocated using the idr_alloc_cyclic() infrastructure. When the idr wrapped around a 32 bit wraparound counter was incremented. The 32 bit wraparound counter served as the upper 32 bits and the allocated idr number as the lower 32 bits.
Since the idr can only allocate up to INT_MAX entries everytime a wraparound happens INT_MAX - 1 entries are lost (Ignoring that numbering always starts at 2 to avoid theoretical collisions with the root inode number.).
If userspace fully populates the idr such that and puts itself into control of two entries such that one entry is somewhere in the middle and the other entry is the INT_MAX entry then it is possible to overflow the wraparound counter. That is probably difficult to pull off but the mere possibility is annoying.
The problem could be contained to 32 bit by switching to a data structure such as the maple tree that allows allocating 64 bit numbers on 64 bit machines. That would leave 32 bit in a lurch but that probably doesn't matter that much. The other problem is that removing entries form the maple tree is somewhat non-trivial because the removal code can be called under the irq write lock of tasklist_lock and irq{save,restore} code.
Instead, allocate unique identifiers for struct pid by simply incrementing a 64 bit counter and insert each struct pid into the rbtree so it can be looked up to decode file handles avoiding to leak actual pids across pid namespaces in file handles.
On both 64 bit and 32 bit the same 64 bit identifier is used to lookup struct pid in the rbtree. On 64 bit the unique identifier for struct pid simply becomes the inode number. Comparing two pidfds continues to be as simple as comparing inode numbers.
On 32 bit the 64 bit number assigned to struct pid is split into two 32 bit numbers. The lower 32 bits are used as the inode number and the upper 32 bits are used as the inode generation number. Whenever a wraparound happens on 32 bit the 64 bit number will be incremented by 2 so inode numbering starts at 2 again.
When a wraparound happens on 32 bit multiple pidfds with the same inode number are likely to exist. This isn't a problem since before pidfs pidfds used the anonymous inode meaning all pidfds had the same inode number. On 32 bit sserspace can thus reconstruct the 64 bit identifier by retrieving both the inode number and the inode generation number to compare, or use file handles. This gives the same guarantees on both 32 bit and 64 bit.
Link: https://lore.kernel.org/r/20241214-gekoppelt-erdarbeiten-a1f9a982a5a6@brauner Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1 |
|
| #
8ce35281 |
| 29-Nov-2024 |
Christian Brauner <[email protected]> |
pidfs: check for valid ioctl commands
Prior to doing any work, check whether the provided ioctl command is supported by pidfs.
Signed-off-by: Christian Brauner <[email protected]>
|
| #
b3caba8f |
| 29-Nov-2024 |
Christian Brauner <[email protected]> |
pidfs: implement file handle support
On 64-bit platforms, userspace can read the pidfd's inode in order to get a never-repeated PID identifier. On 32-bit platforms this identifier is not exposed, as
pidfs: implement file handle support
On 64-bit platforms, userspace can read the pidfd's inode in order to get a never-repeated PID identifier. On 32-bit platforms this identifier is not exposed, as inodes are limited to 32 bits. Instead expose the identifier via export_fh, which makes it available to userspace via name_to_handle_at.
In addition we implement fh_to_dentry, which allows userspace to recover a pidfd from a pidfs file handle.
Signed-off-by: Erin Shepherd <[email protected]> [brauner: patch heavily rewritten] Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Amir Goldstein <[email protected]> Co-Developed-by: Christian Brauner <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
230536ff |
| 29-Nov-2024 |
Christian Brauner <[email protected]> |
pidfs: support FS_IOC_GETVERSION
This will allow 32 bit userspace to detect when a given inode number has been recycled and also to construct a unique 64 bit identifier.
Link: https://lore.kernel.o
pidfs: support FS_IOC_GETVERSION
This will allow 32 bit userspace to detect when a given inode number has been recycled and also to construct a unique 64 bit identifier.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Amir Goldstein <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
03c212bf |
| 29-Nov-2024 |
Christian Brauner <[email protected]> |
pidfs: remove 32bit inode number handling
Now that we have a unified inode number handling model remove the custom ida-based allocation for 32bit.
Link: https://lore.kernel.org/r/20241129-work-pidf
pidfs: remove 32bit inode number handling
Now that we have a unified inode number handling model remove the custom ida-based allocation for 32bit.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Amir Goldstein <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
9698d5a4 |
| 29-Nov-2024 |
Christian Brauner <[email protected]> |
pidfs: rework inode number allocation
Recently we received a patchset that aims to enable file handle encoding and decoding via name_to_handle_at(2) and open_by_handle_at(2).
A crucical step in the
pidfs: rework inode number allocation
Recently we received a patchset that aims to enable file handle encoding and decoding via name_to_handle_at(2) and open_by_handle_at(2).
A crucical step in the patch series is how to go from inode number to struct pid without leaking information into unprivileged contexts. The issue is that in order to find a struct pid the pid number in the initial pid namespace must be encoded into the file handle via name_to_handle_at(2). This can be used by containers using a separate pid namespace to learn what the pid number of a given process in the initial pid namespace is. While this is a weak information leak it could be used in various exploits and in general is an ugly wart in the design.
To solve this problem a new way is needed to lookup a struct pid based on the inode number allocated for that struct pid. The other part is to remove the custom inode number allocation on 32bit systems that is also an ugly wart that should go away.
So, a new scheme is used that I was discusssing with Tejun some time back. A cyclic ida is used for the lower 32 bits and a the high 32 bits are used for the generation number. This gives a 64 bit inode number that is unique on both 32 bit and 64 bit. The lower 32 bit number is recycled slowly and can be used to lookup struct pids.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Amir Goldstein <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3 |
|
| #
cdda1f26 |
| 10-Oct-2024 |
Luca Boccassi <[email protected]> |
pidfd: add ioctl to retrieve pid info
A common pattern when using pid fds is having to get information about the process, which currently requires /proc being mounted, resolving the fd to a pid, and
pidfd: add ioctl to retrieve pid info
A common pattern when using pid fds is having to get information about the process, which currently requires /proc being mounted, resolving the fd to a pid, and then do manual string parsing of /proc/N/status and friends. This needs to be reimplemented over and over in all userspace projects (e.g.: I have reimplemented resolving in systemd, dbus, dbus-daemon, polkit so far), and requires additional care in checking that the fd is still valid after having parsed the data, to avoid races.
Having a programmatic API that can be used directly removes all these requirements, including having /proc mounted.
As discussed at LPC24, add an ioctl with an extensible struct so that more parameters can be added later if needed. Start with returning pid/tgid/ppid and creds unconditionally, and cgroupid optionally.
Signed-off-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc2, v6.12-rc1 |
|
| #
8a460677 |
| 26-Sep-2024 |
Christian Brauner <[email protected]> |
pidfs: check for valid pid namespace
When we access a no-current task's pid namespace we need check that the task hasn't been reaped in the meantime and it's pid namespace isn't accessible anymore.
pidfs: check for valid pid namespace
When we access a no-current task's pid namespace we need check that the task hasn't been reaped in the meantime and it's pid namespace isn't accessible anymore.
The user namespace is fine because it is only released when the last reference to struct task_struct is put and exit_creds() is called.
Link: https://lore.kernel.org/r/20240926-klebt-altgedienten-0415ad4d273c@brauner Fixes: 5b08bd408534 ("pidfs: allow retrieval of namespace file descriptors") CC: [email protected] # v6.11 Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
9b3e1504 |
| 22-Jul-2024 |
Christian Brauner <[email protected]> |
pidfs: handle kernels without namespaces cleanly
The nsproxy structure contains nearly all of the namespaces associated with a task. When a given namespace type is not supported by this kernel the r
pidfs: handle kernels without namespaces cleanly
The nsproxy structure contains nearly all of the namespaces associated with a task. When a given namespace type is not supported by this kernel the rules whether the corresponding pointer in struct nsproxy is NULL or always init_<ns_type>_ns differ per namespace. Ideally, that wouldn't be the case and for all namespace types we'd always set it to init_<ns_type>_ns when the corresponding namespace type isn't supported.
Make sure we handle all namespaces where the pointer in struct nsproxy can be NULL when the namespace type isn't supported.
Link: https://lore.kernel.org/r/20240722-work-pidfs-e6a83030f63e@brauner Fixes: 5b08bd408534 ("pidfs: allow retrieval of namespace file descriptors") # mainline only Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
f60d38cb |
| 21-Jul-2024 |
Edward Adam Davis <[email protected]> |
pidfs: when time ns disabled add check for ioctl
syzbot call pidfd_ioctl() with cmd "PIDFD_GET_TIME_NAMESPACE" and disabled CONFIG_TIME_NS, since time_ns is NULL, it will make NULL ponter deref in o
pidfs: when time ns disabled add check for ioctl
syzbot call pidfd_ioctl() with cmd "PIDFD_GET_TIME_NAMESPACE" and disabled CONFIG_TIME_NS, since time_ns is NULL, it will make NULL ponter deref in open_namespace.
Fixes: 5b08bd408534 ("pidfs: allow retrieval of namespace file descriptors") # mainline only Reported-and-tested-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=34a0ee986f61f15da35d Signed-off-by: Edward Adam Davis <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6 |
|
| #
5b08bd40 |
| 27-Jun-2024 |
Christian Brauner <[email protected]> |
pidfs: allow retrieval of namespace file descriptors
For users that hold a reference to a pidfd procfs might not even be available nor is it desirable to parse through procfs just for the sake of ge
pidfs: allow retrieval of namespace file descriptors
For users that hold a reference to a pidfd procfs might not even be available nor is it desirable to parse through procfs just for the sake of getting namespace file descriptors for a process.
Make it possible to directly retrieve namespace file descriptors from a pidfd. Pidfds already can be used with setns() to change a set of namespaces atomically.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Alexander Mikhalitsyn <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1 |
|
| #
db3d841a |
| 21-May-2024 |
Linus Torvalds <[email protected]> |
fs/pidfs: make 'lsof' happy with our inode changes
pidfs started using much saner inodes in commit b28ddcc32d8f ("pidfs: convert to path_from_stashed() helper"), but that exposed the fact that lsof
fs/pidfs: make 'lsof' happy with our inode changes
pidfs started using much saner inodes in commit b28ddcc32d8f ("pidfs: convert to path_from_stashed() helper"), but that exposed the fact that lsof had some knowledge of just how odd our old anon_inode usage was.
For example, legacy anon_inodes hadn't even initialized the inode type in the inode mode, so everything had a type of zero.
So sane tools like 'stat' would report these files as "weird file", but 'lsof' instead used that (together with the name of the link in proc) to notice that it's an anonymous inode, and used it to detect pidfd files.
Let's keep our internal new sane inode model, but mask the file type bits at 'stat()' time in the getattr() function we already have, and by making the dentry name match what lsof expects too.
This keeps our internal models sane, but should make user space see the same old odd behavior.
Reported-by: Jiri Slaby <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Link: https://github.com/lsof-org/lsof/issues/317 Cc: Alexander Viro <[email protected]> Cc: Seth Forshee <[email protected]> Cc: Tycho Andersen <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
9d9539db |
| 12-Mar-2024 |
Christian Brauner <[email protected]> |
pidfs: remove config option
As Linus suggested this enables pidfs unconditionally. A key property to retain is the ability to compare pidfds by inode number (cf. [1]). That's extremely helpful just
pidfs: remove config option
As Linus suggested this enables pidfs unconditionally. A key property to retain is the ability to compare pidfds by inode number (cf. [1]). That's extremely helpful just as comparing namespace file descriptors by inode number is. They are used in a variety of scenarios where they need to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a socket to trivially authenticate a the sender and various other use-cases.
For 64bit systems this is pretty trivial to do. For 32bit it's slightly more annoying as we discussed but we simply add a dumb ida based allocator that gets used on 32bit. This gives the same guarantees about inode numbers on 64bit without any overflow risk. Practically, we'll never run into overflow issues because we're constrained by the number of processes that can exist on 32bit and by the number of open files that can exist on a 32bit system. On 64bit none of this matters and things are very simple.
If 32bit also needs the uniqueness guarantee they can simply parse the contents of /proc/<pid>/fd/<nr>. The uniqueness guarantees have a variety of use-cases. One of the most obvious ones is that they will make pidfiles (or "pidfdfiles", I guess) reliable as the unique identifier can be placed into there that won't be reycled. Also a frequent request.
Note, I took the chance and simplified path_from_stashed() even further. Instead of passing the inode number explicitly to path_from_stashed() we let the filesystem handle that internally. So path_from_stashed() ends up even simpler than it is now. This is also a good solution allowing the cleanup code to be clean and consistent between 32bit and 64bit. The cleanup path in prepare_anon_dentry() is also switched around so we put the inode before the dentry allocation. This means we only have to call the cleanup handler for the filesystem's inode data once and can rely ->evict_inode() otherwise.
Aside from having to have a bit of extra code for 32bit it actually ends up a nice cleanup for path_from_stashed() imho.
Tested on both 32 and 64bit including error injection.
Link: https://github.com/systemd/systemd/pull/31713 [1] Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7 |
|
| #
e9c5263c |
| 01-Mar-2024 |
Christian Brauner <[email protected]> |
libfs: improve path_from_stashed()
Right now we pass a bunch of info that is fs specific which doesn't make a lot of sense and it bleeds fs sepcific details into the generic helper. nsfs and pidfs h
libfs: improve path_from_stashed()
Right now we pass a bunch of info that is fs specific which doesn't make a lot of sense and it bleeds fs sepcific details into the generic helper. nsfs and pidfs have slightly different needs when initializing inodes. Add simple operations that are stashed in sb->s_fs_info that both can implement. This also allows us to get rid of cleaning up references in the caller. All in all path_from_stashed() becomes way simpler.
Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6 |
|
| #
2558e3b2 |
| 21-Feb-2024 |
Christian Brauner <[email protected]> |
libfs: add stashed_dentry_prune()
Both pidfs and nsfs use a memory location to stash a dentry for reuse by concurrent openers. Right now two custom dentry->d_prune::{ns,pidfs}_prune_dentry() methods
libfs: add stashed_dentry_prune()
Both pidfs and nsfs use a memory location to stash a dentry for reuse by concurrent openers. Right now two custom dentry->d_prune::{ns,pidfs}_prune_dentry() methods are needed that do the same thing. The only thing that differs is that they need to get to the memory location to store or retrieve the dentry from differently. Fix that by remember the stashing location for the dentry in dentry->d_fsdata which allows us to retrieve it in dentry->d_prune. That in turn makes it possible to add a common helper that pidfs and nsfs can both use.
Link: https://lore.kernel.org/r/CAHk-=wg8cHY=i3m6RnXQ2Y2W8psicKWQEZq1=94ivUiviM-0OA@mail.gmail.com Signed-off-by: Christian Brauner <[email protected]>
show more ...
|