|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
7903f907 |
| 06-Feb-2025 |
Mateusz Guzik <[email protected]> |
pid: perform free_pid() calls outside of tasklist_lock
As the clone side already executes pid allocation with only pidmap_lock held, issuing free_pid() while still holding tasklist_lock exacerbates
pid: perform free_pid() calls outside of tasklist_lock
As the clone side already executes pid allocation with only pidmap_lock held, issuing free_pid() while still holding tasklist_lock exacerbates total hold time of the latter.
More things may show up later which require initial clean up with the lock held and allow finishing without it. For that reason a struct to collect such work is added instead of merely passing the pid array.
Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: "Liam R. Howlett" <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
16ecd47c |
| 14-Dec-2024 |
Christian Brauner <[email protected]> |
pidfs: lookup pid through rbtree
The new pid inode number allocation scheme is neat but I overlooked a possible, even though unlikely, attack that can be used to trigger an overflow on both 32bit an
pidfs: lookup pid through rbtree
The new pid inode number allocation scheme is neat but I overlooked a possible, even though unlikely, attack that can be used to trigger an overflow on both 32bit and 64bit.
An unique 64 bit identifier was constructed for each struct pid by two combining a 32 bit idr with a 32 bit generation number. A 32bit number was allocated using the idr_alloc_cyclic() infrastructure. When the idr wrapped around a 32 bit wraparound counter was incremented. The 32 bit wraparound counter served as the upper 32 bits and the allocated idr number as the lower 32 bits.
Since the idr can only allocate up to INT_MAX entries everytime a wraparound happens INT_MAX - 1 entries are lost (Ignoring that numbering always starts at 2 to avoid theoretical collisions with the root inode number.).
If userspace fully populates the idr such that and puts itself into control of two entries such that one entry is somewhere in the middle and the other entry is the INT_MAX entry then it is possible to overflow the wraparound counter. That is probably difficult to pull off but the mere possibility is annoying.
The problem could be contained to 32 bit by switching to a data structure such as the maple tree that allows allocating 64 bit numbers on 64 bit machines. That would leave 32 bit in a lurch but that probably doesn't matter that much. The other problem is that removing entries form the maple tree is somewhat non-trivial because the removal code can be called under the irq write lock of tasklist_lock and irq{save,restore} code.
Instead, allocate unique identifiers for struct pid by simply incrementing a 64 bit counter and insert each struct pid into the rbtree so it can be looked up to decode file handles avoiding to leak actual pids across pid namespaces in file handles.
On both 64 bit and 32 bit the same 64 bit identifier is used to lookup struct pid in the rbtree. On 64 bit the unique identifier for struct pid simply becomes the inode number. Comparing two pidfds continues to be as simple as comparing inode numbers.
On 32 bit the 64 bit number assigned to struct pid is split into two 32 bit numbers. The lower 32 bits are used as the inode number and the upper 32 bits are used as the inode generation number. Whenever a wraparound happens on 32 bit the 64 bit number will be incremented by 2 so inode numbering starts at 2 again.
When a wraparound happens on 32 bit multiple pidfds with the same inode number are likely to exist. This isn't a problem since before pidfs pidfds used the anonymous inode meaning all pidfds had the same inode number. On 32 bit sserspace can thus reconstruct the 64 bit identifier by retrieving both the inode number and the inode generation number to compare, or use file handles. This gives the same guarantees on both 32 bit and 64 bit.
Link: https://lore.kernel.org/r/20241214-gekoppelt-erdarbeiten-a1f9a982a5a6@brauner Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1 |
|
| #
7863dcc7 |
| 22-Nov-2024 |
Christian Brauner <[email protected]> |
pid: allow pid_max to be set per pid namespace
The pid_max sysctl is a global value. For a long time the default value has been 65535 and during the pidfd dicussions Linus proposed to bump pid_max b
pid: allow pid_max to be set per pid namespace
The pid_max sysctl is a global value. For a long time the default value has been 65535 and during the pidfd dicussions Linus proposed to bump pid_max by default (cf. [1]). Based on this discussion systemd started bumping pid_max to 2^22. So all new systems now run with a very high pid_max limit with some distros having also backported that change. The decision to bump pid_max is obviously correct. It just doesn't make a lot of sense nowadays to enforce such a low pid number. There's sufficient tooling to make selecting specific processes without typing really large pid numbers available.
In any case, there are workloads that have expections about how large pid numbers they accept. Either for historical reasons or architectural reasons. One concreate example is the 32-bit version of Android's bionic libc which requires pid numbers less than 65536. There are workloads where it is run in a 32-bit container on a 64-bit kernel. If the host has a pid_max value greater than 65535 the libc will abort thread creation because of size assumptions of pthread_mutex_t.
That's a fairly specific use-case however, in general specific workloads that are moved into containers running on a host with a new kernel and a new systemd can run into issues with large pid_max values. Obviously making assumptions about the size of the allocated pid is suboptimal but we have userspace that does it.
Of course, giving containers the ability to restrict the number of processes in their respective pid namespace indepent of the global limit through pid_max is something desirable in itself and comes in handy in general.
Independent of motivating use-cases the existence of pid namespaces makes this also a good semantical extension and there have been prior proposals pushing in a similar direction. The trick here is to minimize the risk of regressions which I think is doable. The fact that pid namespaces are hierarchical will help us here.
What we mostly care about is that when the host sets a low pid_max limit, say (crazy number) 100 that no descendant pid namespace can allocate a higher pid number in its namespace. Since pid allocation is hierarchial this can be ensured by checking each pid allocation against the pid namespace's pid_max limit. This means if the allocation in the descendant pid namespace succeeds, the ancestor pid namespace can reject it. If the ancestor pid namespace has a higher limit than the descendant pid namespace the descendant pid namespace will reject the pid allocation. The ancestor pid namespace will obviously not care about this. All in all this means pid_max continues to enforce a system wide limit on the number of processes but allows pid namespaces sufficient leeway in handling workloads with assumptions about pid values and allows containers to restrict the number of processes in a pid namespace through the pid_max interface.
[1]: https://lore.kernel.org/linux-api/CAHk-=wiZ40LVjnXSi9iHLE_-ZBsWFGCgdmNiYZUXn1-V5YBg2g@mail.gmail.com - rebased from 5.14-rc1 - a few fixes (missing ns_free_inum on error path, missing initialization, etc) - permission check changes in pid_table_root_permissions - unsigned int pid_max -> int pid_max (keep pid_max type as it was) - add READ_ONCE in alloc_pid() as suggested by Christian - rebased from 6.7 and take into account: * sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table) * sysctl: treewide: constify ctl_table_header::ctl_table_arg * pidfd: add pidfs * tracing: Move saved_cmdline code into trace_sched_switch.c
Signed-off-by: Alexander Mikhalitsyn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
9d9539db |
| 12-Mar-2024 |
Christian Brauner <[email protected]> |
pidfs: remove config option
As Linus suggested this enables pidfs unconditionally. A key property to retain is the ability to compare pidfds by inode number (cf. [1]). That's extremely helpful just
pidfs: remove config option
As Linus suggested this enables pidfs unconditionally. A key property to retain is the ability to compare pidfds by inode number (cf. [1]). That's extremely helpful just as comparing namespace file descriptors by inode number is. They are used in a variety of scenarios where they need to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a socket to trivially authenticate a the sender and various other use-cases.
For 64bit systems this is pretty trivial to do. For 32bit it's slightly more annoying as we discussed but we simply add a dumb ida based allocator that gets used on 32bit. This gives the same guarantees about inode numbers on 64bit without any overflow risk. Practically, we'll never run into overflow issues because we're constrained by the number of processes that can exist on 32bit and by the number of open files that can exist on a 32bit system. On 64bit none of this matters and things are very simple.
If 32bit also needs the uniqueness guarantee they can simply parse the contents of /proc/<pid>/fd/<nr>. The uniqueness guarantees have a variety of use-cases. One of the most obvious ones is that they will make pidfiles (or "pidfdfiles", I guess) reliable as the unique identifier can be placed into there that won't be reycled. Also a frequent request.
Note, I took the chance and simplified path_from_stashed() even further. Instead of passing the inode number explicitly to path_from_stashed() we let the filesystem handle that internally. So path_from_stashed() ends up even simpler than it is now. This is also a good solution allowing the cleanup code to be clean and consistent between 32bit and 64bit. The cleanup path in prepare_anon_dentry() is also switched around so we put the inode before the dentry allocation. This means we only have to call the cleanup handler for the filesystem's inode data once and can rely ->evict_inode() otherwise.
Aside from having to have a bit of extra code for 32bit it actually ends up a nice cleanup for path_from_stashed() imho.
Tested on both 32 and 64bit including error injection.
Link: https://github.com/systemd/systemd/pull/31713 [1] Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7, v6.8-rc6 |
|
| #
b28ddcc3 |
| 19-Feb-2024 |
Christian Brauner <[email protected]> |
pidfs: convert to path_from_stashed() helper
Moving pidfds from the anonymous inode infrastructure to a separate tiny in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes selinux
pidfs: convert to path_from_stashed() helper
Moving pidfds from the anonymous inode infrastructure to a separate tiny in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes selinux denials and thus various userspace components that make heavy use of pidfds to fail as pidfds used anon_inode_getfile() which aren't subject to any LSM hooks. But dentry_open() is and that would cause regressions.
The failures that are seen are selinux denials. But the core failure is dbus-broker. That cascades into other services failing that depend on dbus-broker. For example, when dbus-broker fails to start polkit and all the others won't be able to work because they depend on dbus-broker.
The reason for dbus-broker failing is because it doesn't handle failures for SO_PEERPIDFD correctly. Last kernel release we introduced SO_PEERPIDFD (and SCM_PIDFD). SO_PEERPIDFD allows dbus-broker and polkit and others to receive a pidfd for the peer of an AF_UNIX socket. This is the first time in the history of Linux that we can safely authenticate clients in a race-free manner.
dbus-broker immediately made use of this but messed up the error checking. It only allowed EINVAL as a valid failure for SO_PEERPIDFD. That's obviously problematic not just because of LSM denials but because of seccomp denials that would prevent SO_PEERPIDFD from working; or any other new error code from there.
So this is catching a flawed implementation in dbus-broker as well. It has to fallback to the old pid-based authentication when SO_PEERPIDFD doesn't work no matter the reasons otherwise it'll always risk such failures. So overall that LSM denial should not have caused dbus-broker to fail. It can never assume that a feature released one kernel ago like SO_PEERPIDFD can be assumed to be available.
So, the next fix separate from the selinux policy update is to try and fix dbus-broker at [3]. That should make it into Fedora as well. In addition the selinux reference policy should also be updated. See [4] for that. If Selinux is in enforcing mode in userspace and it encounters anything that it doesn't know about it will deny it by default. And the policy is entirely in userspace including declaring new types for stuff like nsfs or pidfs to allow it.
For now we continue to raise S_PRIVATE on the inode if it's a pidfs inode which means things behave exactly like before.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2265630 Link: https://github.com/fedora-selinux/selinux-policy/pull/2050 Link: https://github.com/bus1/dbus-broker/pull/343 [3] Link: https://github.com/SELinuxProject/refpolicy/pull/762 [4] Reported-by: Nathan Chancellor <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc5 |
|
| #
cb12fd8e |
| 12-Feb-2024 |
Christian Brauner <[email protected]> |
pidfd: add pidfs
This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't abl
pidfd: add pidfs
This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't able to do simply because of the very justified limitations of anonymous inodes. Moving pidfds to a tiny pseudo filesystem allows:
* statx() on pidfds becomes useful for the first time. * pidfds can be compared simply via statx() and then comparing inode numbers. * pidfds have unique inode numbers for the system lifetime. * struct pid is now stashed in inode->i_private instead of file->private_data. This means it is now possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. * file->private_data is freed up for per-file options for pidfds. * Each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. Even if we were to move to anon_inode_create_getfile() which creates new inodes we'd still be associating the same struct pid with multiple different inodes.
The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no complex inode operations, nothing. Dentries and inodes are always deleted when the last pidfd is closed.
We allocate a new inode for each struct pid and we reuse that inode for all pidfds. We use iget_locked() to find that inode again based on the inode number which isn't recycled. We allocate a new dentry for each pidfd that uses the same inode. That is similar to anonymous inodes which reuse the same inode for thousands of dentries. For pidfds we're talking way less than that. There usually won't be a lot of concurrent openers of the same struct pid. They can probably often be counted on two hands. I know that systemd does use separate pidfd for the same struct pid for various complex process tracking issues. So I think with that things actually become way simpler. Especially because we don't have to care about lookup. Dentries and inodes continue to be always deleted.
The code is entirely optional and fairly small. If it's not selected we fallback to anonymous inodes. Heavily inspired by nsfs which uses a similar stashing mechanism just for namespaces.
Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc4, v6.8-rc3 |
|
| #
64bef697 |
| 31-Jan-2024 |
Oleg Nesterov <[email protected]> |
pidfd: implement PIDFD_THREAD flag for pidfd_open()
With this flag:
- pidfd_open() doesn't require that the target task must be a thread-group leader
- pidfd_poll() succeeds when the task exi
pidfd: implement PIDFD_THREAD flag for pidfd_open()
With this flag:
- pidfd_open() doesn't require that the target task must be a thread-group leader
- pidfd_poll() succeeds when the task exits and becomes a zombie (iow, passes exit_notify()), even if it is a leader and thread-group is not empty.
This means that the behaviour of pidfd_poll(PIDFD_THREAD, pid-of-group-leader) is not well defined if it races with exec() from its sub-thread; pidfd_poll() can succeed or not depending on whether pidfd_task_exited() is called before or after exchange_tids().
Perhaps we can improve this behaviour later, pidfd_poll() can probably take sig->group_exec_task into account. But this doesn't really differ from the case when the leader exits before other threads (so pidfd_poll() succeeds) and then another thread execs and pidfd_poll() will block again.
thread_group_exited() is no longer used, perhaps it can die.
Co-developed-by: Tycho Andersen <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Tested-by: Tycho Andersen <[email protected]> Reviewed-by: Tycho Andersen <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc2 |
|
| #
cdefbf23 |
| 25-Jan-2024 |
Oleg Nesterov <[email protected]> |
pidfd: cleanup the usage of __pidfd_prepare's flags
- make pidfd_create() static.
- Don't pass O_RDWR | O_CLOEXEC to __pidfd_prepare() in copy_process(), __pidfd_prepare() adds these flags uncond
pidfd: cleanup the usage of __pidfd_prepare's flags
- make pidfd_create() static.
- Don't pass O_RDWR | O_CLOEXEC to __pidfd_prepare() in copy_process(), __pidfd_prepare() adds these flags unconditionally.
- Kill the flags check in __pidfd_prepare(). sys_pidfd_open() checks the flags itself, all other users of pidfd_prepare() pass flags = 0.
If we need a sanity check for those other in kernel users then WARN_ON_ONCE(flags & ~PIDFD_NONBLOCK) makes more sense.
- Don't pass O_RDWR to get_unused_fd_flags(), it ignores everything except O_CLOEXEC.
- Don't pass O_CLOEXEC to anon_inode_getfile(), it ignores everything except O_ACCMODE | O_NONBLOCK.
Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6 |
|
| #
f551103c |
| 15-Dec-2023 |
Kent Overstreet <[email protected]> |
sched.h: move pid helpers to pid.h
This is needed for killing the sched.h dependency on rcupdate.h, and pid.h is a better place for this code anyways.
Signed-off-by: Kent Overstreet <kent.overstree
sched.h: move pid helpers to pid.h
This is needed for killing the sched.h dependency on rcupdate.h, and pid.h is a better place for this code anyways.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
6d5e9d63 |
| 11-Dec-2023 |
Kent Overstreet <[email protected]> |
pid: Split out pid_types.h
Trimming down sched.h dependencies: we dont't want to include more than the base types.
Cc: Kees Cook <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc
pid: Split out pid_types.h
Trimming down sched.h dependencies: we dont't want to include more than the base types.
Cc: Kees Cook <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Will Drewry <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1 |
|
| #
b69f0aeb |
| 30-Jun-2023 |
Kees Cook <[email protected]> |
pid: Replace struct pid 1-element array with flex-array
For pid namespaces, struct pid uses a dynamically sized array member, "numbers". This was implemented using the ancient 1-element fake flexib
pid: Replace struct pid 1-element array with flex-array
For pid namespaces, struct pid uses a dynamically sized array member, "numbers". This was implemented using the ancient 1-element fake flexible array, which has been deprecated for decades.
Replace it with a C99 flexible array, refactor the array size calculations to use struct_size(), and address elements via indexes. Note that the static initializer (which defines a single element) works as-is, and requires no special handling.
Without this, CONFIG_UBSAN_BOUNDS (and potentially CONFIG_FORTIFY_SOURCE) will trigger bounds checks:
https://lore.kernel.org/lkml/20230517-bushaltestelle-super-e223978c1ba6@brauner
Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Daniel Verkamp <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Frederic Weisbecker <[email protected]> Reported-by: [email protected] [brauner: dropped unrelated changes and remove 0 with NULL cast] Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5 |
|
| #
6ae930d9 |
| 27-Mar-2023 |
Christian Brauner <[email protected]> |
pid: add pidfd_prepare()
Add a new helper that allows to reserve a pidfd and allocates a new pidfd file that stashes the provided struct pid. This will allow us to remove places that either open cod
pid: add pidfd_prepare()
Add a new helper that allows to reserve a pidfd and allocates a new pidfd file that stashes the provided struct pid. This will allow us to remove places that either open code this function or that call pidfd_create() but then have to call close_fd() because there are still failure points after pidfd_create() has been called.
Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6 |
|
| #
e9bdcdbf |
| 11-Oct-2021 |
Christian Brauner <[email protected]> |
pid: add pidfd_get_task() helper
The number of system calls making use of pidfds is constantly increasing. Some of those new system calls duplicate the code to turn a pidfd into task_struct it refer
pid: add pidfd_get_task() helper
The number of system calls making use of pidfds is constantly increasing. Some of those new system calls duplicate the code to turn a pidfd into task_struct it refers to. Give them a simple helper for this.
Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Cc: Vlastimil Babka <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Matthew Bobrowski <[email protected]> Cc: Alexander Duyck <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jan Kara <[email protected]> Cc: Minchan Kim <[email protected]> Reviewed-by: Matthew Bobrowski <[email protected]> Acked-by: David Hildenbrand <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5 |
|
| #
c576e0fc |
| 08-Aug-2021 |
Matthew Bobrowski <[email protected]> |
kernel/pid.c: remove static qualifier from pidfd_create()
With the idea of returning pidfds from the fanotify API, we need to expose a mechanism for creating pidfds. We drop the static qualifier fro
kernel/pid.c: remove static qualifier from pidfd_create()
With the idea of returning pidfds from the fanotify API, we need to expose a mechanism for creating pidfds. We drop the static qualifier from pidfd_create() and add its declaration to linux/pid.h so that the pidfd_create() helper can be called from other kernel subsystems i.e. fanotify.
Link: https://lore.kernel.org/r/0c68653ec32f1b7143301f0231f7ed14062fd82b.1628398044.git.repnop@google.com Signed-off-by: Matthew Bobrowski <[email protected]> Acked-by: Christian Brauner <[email protected]> Signed-off-by: Jan Kara <[email protected]>
show more ...
|
|
Revision tags: v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1 |
|
| #
1aa92cd3 |
| 17-Oct-2020 |
Minchan Kim <[email protected]> |
pid: move pidfd_get_pid() to pid.c
process_madvise syscall needs pidfd_get_pid function to translate pidfd to pid so this patch move the function to kernel/pid.c.
Suggested-by: Alexander Duyck <ale
pid: move pidfd_get_pid() to pid.c
process_madvise syscall needs pidfd_get_pid function to translate pidfd to pid so this patch move the function to kernel/pid.c.
Suggested-by: Alexander Duyck <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Alexander Duyck <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Acked-by: Christian Brauner <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jann Horn <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Daniel Colascione <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Dias <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleksandr Natalenko <[email protected]> Cc: Sandeep Patil <[email protected]> Cc: SeongJae Park <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sonny Rao <[email protected]> Cc: Tim Murray <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Florian Weimer <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2 |
|
| #
6b03d130 |
| 19-Apr-2020 |
Eric W. Biederman <[email protected]> |
proc: Ensure we see the exit of each process tid exactly once
When the thread group leader changes during exec and the old leaders thread is reaped proc_flush_pid will flush the dentries for the ent
proc: Ensure we see the exit of each process tid exactly once
When the thread group leader changes during exec and the old leaders thread is reaped proc_flush_pid will flush the dentries for the entire process because the leader still has it's original pid.
Fix this by exchanging the pids in an rcu safe manner, and wrapping the code to do that up in a helper exchange_tids.
When I removed switch_exec_pids and introduced this behavior in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there really was nothing that cared as flushing happened with the cached dentry and de_thread flushed both of them on exec.
This lack of fully exchanging pids became a problem a few months later when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization"). Which overlooked the de_thread case was no longer swapping pids, and I was looking up proc dentries by task->pid.
The current behavior isn't properly a bug as everything in proc will continue to work correctly just a little bit less efficiently. Fix this just so there are no little surprise corner cases waiting to bite people.
-- Oleg points out this could be an issue in next_tgid in proc where has_group_leader_pid is called, and reording some of the assignments should fix that.
-- Oleg points out this will break the 10 year old hack in __exit_signal.c > /* > * This can only happen if the caller is de_thread(). > * FIXME: this is the temporary hack, we should teach > * posix-cpu-timers to handle this case correctly. > */ > if (unlikely(has_group_leader_pid(tsk))) > posix_cpu_timers_exit_group(tsk);
The code in next_tgid has been changed to use PIDTYPE_TGID, and the posix cpu timers code has been fixed so it does not need the 10 year old hack, so this should be safe to merge now.
Link: https://lore.kernel.org/lkml/[email protected]/ Acked-by: Linus Torvalds <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization"). Signed-off-by: Eric W. Biederman <[email protected]>
show more ...
|
| #
2374c09b |
| 24-Apr-2020 |
Christoph Hellwig <[email protected]> |
sysctl: remove all extern declaration from sysctl.c
Extern declarations in .c files are a bad style and can lead to mismatches. Use existing definitions in headers where they exist, and otherwise m
sysctl: remove all extern declaration from sysctl.c
Extern declarations in .c files are a bad style and can lead to mismatches. Use existing definitions in headers where they exist, and otherwise move the external declarations to suitable header files.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v5.7-rc1 |
|
| #
63f818f4 |
| 07-Apr-2020 |
Eric W. Biederman <[email protected]> |
proc: Use a dedicated lock in struct pid
syzbot wrote: > ======================================================== > WARNING: possible irq lock inversion dependency detected > 5.6.0-syzkaller #0 Not
proc: Use a dedicated lock in struct pid
syzbot wrote: > ======================================================== > WARNING: possible irq lock inversion dependency detected > 5.6.0-syzkaller #0 Not tainted > -------------------------------------------------------- > swapper/1/0 just changed the state of lock: > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840 > but this lock took another, SOFTIRQ-unsafe lock in the past: > (&pid->wait_pidfd){+.+.}-{2:2} > > > and interrupts could create inverse lock ordering between them. > > > other info that might help us debug this: > Possible interrupt unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&pid->wait_pidfd); > local_irq_disable(); > lock(tasklist_lock); > lock(&pid->wait_pidfd); > <Interrupt> > lock(tasklist_lock); > > *** DEADLOCK *** > > 4 locks held by swapper/1/0:
The problem is that because wait_pidfd.lock is taken under the tasklist lock. It must always be taken with irqs disabled as tasklist_lock can be taken from interrupt context and if wait_pidfd.lock was already taken this would create a lock order inversion.
Oleg suggested just disabling irqs where I have added extra calls to wait_pidfd.lock. That should be safe and I think the code will eventually do that. It was rightly pointed out by Christian that sharing the wait_pidfd.lock was a premature optimization.
It is also true that my pre-merge window testing was insufficient. So remove the premature optimization and give struct pid a dedicated lock of it's own for struct pid things. I have verified that lockdep sees all 3 paths where we take the new pid->lock and lockdep does not complain.
It is my current day dream that one day pid->lock can be used to guard the task lists as well and then the tasklist_lock won't need to be held to deliver signals. That will require taking pid->lock with irqs disabled.
Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ Cc: Oleg Nesterov <[email protected]> Cc: Christian Brauner <[email protected]> Reported-by: [email protected] Reported-by: [email protected] Reported-by: [email protected] Reported-by: [email protected] Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc") Signed-off-by: "Eric W. Biederman" <[email protected]>
show more ...
|
|
Revision tags: v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3 |
|
| #
7bc3e6e5 |
| 20-Feb-2020 |
Eric W. Biederman <[email protected]> |
proc: Use a list of inodes to flush from proc
Rework the flushing of proc to use a list of directory inodes that need to be flushed.
The list is kept on struct pid not on struct task_struct, as the
proc: Use a list of inodes to flush from proc
Rework the flushing of proc to use a list of directory inodes that need to be flushed.
The list is kept on struct pid not on struct task_struct, as there is a fixed connection between proc inodes and pids but at least for the case of de_thread the pid of a task_struct changes.
This removes the dependency on proc_mnt which allows for different mounts of proc having different mount options even in the same pid namespace and this allows for the removal of proc_mnt which will trivially the first mount of proc to honor it's mount options.
This flushing remains an optimization. The functions pid_delete_dentry and pid_revalidate ensure that ordinary dcache management will not attempt to use dentries past the point their respective task has died. When unused the shrinker will eventually be able to remove these dentries.
There is a case in de_thread where proc_flush_pid can be called early for a given pid. Which winds up being safe (if suboptimal) as this is just an optiimization.
Only pid directories are put on the list as the other per pid files are children of those directories and d_invalidate on the directory will get them as well.
So that the pid can be used during flushing it's reference count is taken in release_task and dropped in proc_flush_pid. Further the call of proc_flush_pid is moved after the tasklist_lock is released in release_task so that it is certain that the pid has already been unhashed when flushing it taking place. This removes a small race where a dentry could recreated.
As struct pid is supposed to be small and I need a per pid lock I reuse the only lock that currently exists in struct pid the the wait_pidfd.lock.
The net result is that this adds all of this functionality with just a little extra list management overhead and a single extra pointer in struct pid.
v2: Initialize pid->inodes. I somehow failed to get that initialization into the initial version of the patch. A boot failure was reported by "kernel test robot <[email protected]>", and failure to initialize that pid->inodes matches all of the reported symptoms.
Signed-off-by: Eric W. Biederman <[email protected]>
show more ...
|
|
Revision tags: v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3, v5.5-rc2, v5.5-rc1, v5.4, v5.4-rc8 |
|
| #
49cb2fc4 |
| 15-Nov-2019 |
Adrian Reber <[email protected]> |
fork: extend clone3() to support setting a PID
The main motivation to add set_tid to clone3() is CRIU.
To restore a process with the same PID/TID CRIU currently uses /proc/sys/kernel/ns_last_pid. I
fork: extend clone3() to support setting a PID
The main motivation to add set_tid to clone3() is CRIU.
To restore a process with the same PID/TID CRIU currently uses /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to ns_last_pid and then (quickly) does a clone(). This works most of the time, but it is racy. It is also slow as it requires multiple syscalls.
Extending clone3() to support *set_tid makes it possible restore a process using CRIU without accessing /proc/sys/kernel/ns_last_pid and race free (as long as the desired PID/TID is available).
This clone3() extension places the same restrictions (CAP_SYS_ADMIN) on clone3() with *set_tid as they are currently in place for ns_last_pid.
The original version of this change was using a single value for set_tid. At the 2019 LPC, after presenting set_tid, it was, however, decided to change set_tid to an array to enable setting the PID of a process in multiple PID namespaces at the same time. If a process is created in a PID namespace it is possible to influence the PID inside and outside of the PID namespace. Details also in the corresponding selftest.
To create a process with the following PIDs:
PID NS level Requested PID 0 (host) 31496 1 42 2 1
For that example the two newly introduced parameters to struct clone_args (set_tid and set_tid_size) would need to be:
set_tid[0] = 1; set_tid[1] = 42; set_tid[2] = 31496; set_tid_size = 3;
If only the PIDs of the two innermost nested PID namespaces should be defined it would look like this:
set_tid[0] = 1; set_tid[1] = 42; set_tid_size = 2;
The PID of the newly created process would then be the next available free PID in the PID namespace level 0 (host) and 42 in the PID namespace at level 1 and the PID of the process in the innermost PID namespace would be 1.
The set_tid array is used to specify the PID of a process starting from the innermost nested PID namespaces up to set_tid_size PID namespaces.
set_tid_size cannot be larger then the current PID namespace level.
Signed-off-by: Adrian Reber <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Reviewed-by: Dmitry Safonov <[email protected]> Acked-by: Andrei Vagin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v5.4-rc7, v5.4-rc6, v5.4-rc5, v5.4-rc4 |
|
| #
3d6d8da4 |
| 17-Oct-2019 |
Christian Brauner <[email protected]> |
pidfd: check pid has attached task in fdinfo
Currently, when a task is dead we still print the pid it used to use in the fdinfo files of its pidfds. This doesn't make much sense since the pid may ha
pidfd: check pid has attached task in fdinfo
Currently, when a task is dead we still print the pid it used to use in the fdinfo files of its pidfds. This doesn't make much sense since the pid may have already been reused. So verify that the task is still alive by introducing the pid_has_task() helper which will be used by other callers in follow-up patches. If the task is not alive anymore, we will print -1. This allows us to differentiate between a task not being present in a given pid namespace - in which case we already print 0 - and a task having been reaped.
Note that this uses PIDTYPE_PID for the check. Technically, we could've checked PIDTYPE_TGID since pidfds currently only refer to thread-group leaders but if they won't anymore in the future then this check becomes problematic without it being immediately obvious to non-experts imho. If a thread is created via clone(CLONE_THREAD) than struct pid has a single non-empty list pid->tasks[PIDTYPE_PID] and this pid can't be used as a PIDTYPE_TGID meaning pid->tasks[PIDTYPE_TGID] will return NULL even though the thread-group leader might still be very much alive. So checking PIDTYPE_PID is fine and is easier to maintain should we ever allow pidfds to refer to threads.
Cc: Jann Horn <[email protected]> Cc: Christian Kellner <[email protected]> Cc: [email protected] Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.4-rc3, v5.4-rc2, v5.4-rc1, v5.3, v5.3-rc8, v5.3-rc7, v5.3-rc6, v5.3-rc5, v5.3-rc4, v5.3-rc3, v5.3-rc2 |
|
| #
3695eae5 |
| 27-Jul-2019 |
Christian Brauner <[email protected]> |
pidfd: add P_PIDFD to waitid()
This adds the P_PIDFD type to waitid(). One of the last remaining bits for the pidfd api is to make it possible to wait on pidfds. With P_PIDFD added to waitid() the p
pidfd: add P_PIDFD to waitid()
This adds the P_PIDFD type to waitid(). One of the last remaining bits for the pidfd api is to make it possible to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace that want to use the pidfd api to exclusively manage processes can do so now.
One of the things this will unblock in the future is the ability to make it possible to retrieve the exit status via waitid(P_PIDFD) for non-parent processes if handed a _suitable_ pidfd that has this feature set. This is similar to what you can do on FreeBSD with kqueue(). It might even end up being possible to wait on a process as a non-parent if an appropriate property is enabled on the pidfd.
With P_PIDFD no scoping of the process identified by the pidfd is possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0), waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics equivalent to wait4(pid), waitid(P_PID). Users that need scoping should rely on pid-based wait*() syscalls for now.
Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: David Howells <[email protected]> Cc: Jann Horn <[email protected]> Cc: Andy Lutomirsky <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Al Viro <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.3-rc1 |
|
| #
f57e515a |
| 16-Jul-2019 |
Joel Fernandes (Google) <[email protected]> |
kernel/pid.c: convert struct pid count to refcount_t
struct pid's count is an atomic_t field used as a refcount. Use refcount_t for it which is basically atomic_t but does additional checking to pr
kernel/pid.c: convert struct pid count to refcount_t
struct pid's count is an atomic_t field used as a refcount. Use refcount_t for it which is basically atomic_t but does additional checking to prevent use-after-free bugs.
For memory ordering, the only change is with the following:
- if ((atomic_read(&pid->count) == 1) || - atomic_dec_and_test(&pid->count)) { + if (refcount_dec_and_test(&pid->count)) { kmem_cache_free(ns->pid_cachep, pid);
Here the change is from: Fully ordered --> RELEASE + ACQUIRE (as per refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the free happens after the refcount_dec_and_test().
The above hunk also removes atomic_read() since it is not needed for the code to work and it is unclear how beneficial it is. The removal lets refcount_dec_and_test() check for cases where get_pid() happened before the object was freed.
Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joel Fernandes (Google) <[email protected]> Reviewed-by: Andrea Parri <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Elena Reshetova <[email protected]> Cc: Jann Horn <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: KJ Tsanaktsidis <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.2, v5.2-rc7, v5.2-rc6, v5.2-rc5, v5.2-rc4, v5.2-rc3, v5.2-rc2, v5.2-rc1, v5.1 |
|
| #
b53b0b9d |
| 30-Apr-2019 |
Joel Fernandes (Google) <[email protected]> |
pidfd: add polling support
This patch adds polling support to pidfd.
Android low memory killer (LMK) needs to know when a process dies once it is sent the kill signal. It does so by checking for th
pidfd: add polling support
This patch adds polling support to pidfd.
Android low memory killer (LMK) needs to know when a process dies once it is sent the kill signal. It does so by checking for the existence of /proc/pid which is both racy and slow. For example, if a PID is reused between when LMK sends a kill signal and checks for existence of the PID, since the wrong PID is now possibly checked for existence. Using the polling support, LMK will be able to get notified when a process exists in race-free and fast way, and allows the LMK to do other things (such as by polling on other fds) while awaiting the process being killed to die.
For notification to polling processes, we follow the same existing mechanism in the kernel used when the parent of the task group is to be notified of a child's death (do_notify_parent). This is precisely when the tasks waiting on a poll of pidfd are also awakened in this patch.
We have decided to include the waitqueue in struct pid for the following reasons: 1. The wait queue has to survive for the lifetime of the poll. Including it in task_struct would not be option in this case because the task can be reaped and destroyed before the poll returns.
2. By including the struct pid for the waitqueue means that during de_thread(), the new thread group leader automatically gets the new waitqueue/pid even though its task_struct is different.
Appropriate test cases are added in the second patch to provide coverage of all the cases the patch is handling.
Cc: Andy Lutomirski <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Daniel Colascione <[email protected]> Cc: Jann Horn <[email protected]> Cc: Tim Murray <[email protected]> Cc: Jonathan Kowalski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Al Viro <[email protected]> Cc: Kees Cook <[email protected]> Cc: David Howells <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: [email protected] Reviewed-by: Oleg Nesterov <[email protected]> Co-developed-by: Daniel Colascione <[email protected]> Signed-off-by: Daniel Colascione <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v5.1-rc7, v5.1-rc6, v5.1-rc5, v5.1-rc4, v5.1-rc3 |
|
| #
b3e58382 |
| 27-Mar-2019 |
Christian Brauner <[email protected]> |
clone: add CLONE_PIDFD
This patchset makes it possible to retrieve pid file descriptors at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally
clone: add CLONE_PIDFD
This patchset makes it possible to retrieve pid file descriptors at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. As spotted by Linus, there is exactly one bit for clone() left.
CLONE_PIDFD creates file descriptors based on the anonymous inode implementation in the kernel that will also be used to implement the new mount api. They serve as a simple opaque handle on pids. Logically, this makes it possible to interpret a pidfd differently, narrowing or widening the scope of various operations (e.g. signal sending). Thus, a pidfd cannot just refer to a tgid, but also a tid, or in theory - given appropriate flag arguments in relevant syscalls - a process group or session. A pidfd does not represent a privilege. This does not imply it cannot ever be that way but for now this is not the case.
A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the parent_tidptr argument of clone. This has the advantage that we can give back the associated pid and the pidfd at the same time.
To remove worries about missing metadata access this patchset comes with a sample program that illustrates how a combination of CLONE_PIDFD, and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/<pid>. The sample program can easily be translated into a helper that would be suitable for inclusion in libc so that users don't have to worry about writing it themselves.
Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Co-developed-by: Jann Horn <[email protected]> Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Kees Cook <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: David Howells <[email protected]> Cc: "Michael Kerrisk (man-pages)" <[email protected]> Cc: Andy Lutomirsky <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Al Viro <[email protected]>
show more ...
|