History log of /linux-6.15/include/linux/pid.h (Results 1 – 25 of 74)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2
# 7903f907 06-Feb-2025 Mateusz Guzik <[email protected]>

pid: perform free_pid() calls outside of tasklist_lock

As the clone side already executes pid allocation with only pidmap_lock
held, issuing free_pid() while still holding tasklist_lock exacerbates

pid: perform free_pid() calls outside of tasklist_lock

As the clone side already executes pid allocation with only pidmap_lock
held, issuing free_pid() while still holding tasklist_lock exacerbates
total hold time of the latter.

More things may show up later which require initial clean up with the
lock held and allow finishing without it. For that reason a struct to
collect such work is added instead of merely passing the pid array.

Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: "Liam R. Howlett" <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3
# 16ecd47c 14-Dec-2024 Christian Brauner <[email protected]>

pidfs: lookup pid through rbtree

The new pid inode number allocation scheme is neat but I overlooked a
possible, even though unlikely, attack that can be used to trigger an
overflow on both 32bit an

pidfs: lookup pid through rbtree

The new pid inode number allocation scheme is neat but I overlooked a
possible, even though unlikely, attack that can be used to trigger an
overflow on both 32bit and 64bit.

An unique 64 bit identifier was constructed for each struct pid by two
combining a 32 bit idr with a 32 bit generation number. A 32bit number
was allocated using the idr_alloc_cyclic() infrastructure. When the idr
wrapped around a 32 bit wraparound counter was incremented. The 32 bit
wraparound counter served as the upper 32 bits and the allocated idr
number as the lower 32 bits.

Since the idr can only allocate up to INT_MAX entries everytime a
wraparound happens INT_MAX - 1 entries are lost (Ignoring that numbering
always starts at 2 to avoid theoretical collisions with the root inode
number.).

If userspace fully populates the idr such that and puts itself into
control of two entries such that one entry is somewhere in the middle
and the other entry is the INT_MAX entry then it is possible to overflow
the wraparound counter. That is probably difficult to pull off but the
mere possibility is annoying.

The problem could be contained to 32 bit by switching to a data
structure such as the maple tree that allows allocating 64 bit numbers
on 64 bit machines. That would leave 32 bit in a lurch but that probably
doesn't matter that much. The other problem is that removing entries
form the maple tree is somewhat non-trivial because the removal code can
be called under the irq write lock of tasklist_lock and
irq{save,restore} code.

Instead, allocate unique identifiers for struct pid by simply
incrementing a 64 bit counter and insert each struct pid into the rbtree
so it can be looked up to decode file handles avoiding to leak actual
pids across pid namespaces in file handles.

On both 64 bit and 32 bit the same 64 bit identifier is used to lookup
struct pid in the rbtree. On 64 bit the unique identifier for struct pid
simply becomes the inode number. Comparing two pidfds continues to be as
simple as comparing inode numbers.

On 32 bit the 64 bit number assigned to struct pid is split into two 32
bit numbers. The lower 32 bits are used as the inode number and the
upper 32 bits are used as the inode generation number. Whenever a
wraparound happens on 32 bit the 64 bit number will be incremented by 2
so inode numbering starts at 2 again.

When a wraparound happens on 32 bit multiple pidfds with the same inode
number are likely to exist. This isn't a problem since before pidfs
pidfds used the anonymous inode meaning all pidfds had the same inode
number. On 32 bit sserspace can thus reconstruct the 64 bit identifier
by retrieving both the inode number and the inode generation number to
compare, or use file handles. This gives the same guarantees on both 32
bit and 64 bit.

Link: https://lore.kernel.org/r/20241214-gekoppelt-erdarbeiten-a1f9a982a5a6@brauner
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.13-rc2, v6.13-rc1
# 7863dcc7 22-Nov-2024 Christian Brauner <[email protected]>

pid: allow pid_max to be set per pid namespace

The pid_max sysctl is a global value. For a long time the default value
has been 65535 and during the pidfd dicussions Linus proposed to bump
pid_max b

pid: allow pid_max to be set per pid namespace

The pid_max sysctl is a global value. For a long time the default value
has been 65535 and during the pidfd dicussions Linus proposed to bump
pid_max by default (cf. [1]). Based on this discussion systemd started
bumping pid_max to 2^22. So all new systems now run with a very high
pid_max limit with some distros having also backported that change.
The decision to bump pid_max is obviously correct. It just doesn't make
a lot of sense nowadays to enforce such a low pid number. There's
sufficient tooling to make selecting specific processes without typing
really large pid numbers available.

In any case, there are workloads that have expections about how large
pid numbers they accept. Either for historical reasons or architectural
reasons. One concreate example is the 32-bit version of Android's bionic
libc which requires pid numbers less than 65536. There are workloads
where it is run in a 32-bit container on a 64-bit kernel. If the host
has a pid_max value greater than 65535 the libc will abort thread
creation because of size assumptions of pthread_mutex_t.

That's a fairly specific use-case however, in general specific workloads
that are moved into containers running on a host with a new kernel and a
new systemd can run into issues with large pid_max values. Obviously
making assumptions about the size of the allocated pid is suboptimal but
we have userspace that does it.

Of course, giving containers the ability to restrict the number of
processes in their respective pid namespace indepent of the global limit
through pid_max is something desirable in itself and comes in handy in
general.

Independent of motivating use-cases the existence of pid namespaces
makes this also a good semantical extension and there have been prior
proposals pushing in a similar direction.
The trick here is to minimize the risk of regressions which I think is
doable. The fact that pid namespaces are hierarchical will help us here.

What we mostly care about is that when the host sets a low pid_max
limit, say (crazy number) 100 that no descendant pid namespace can
allocate a higher pid number in its namespace. Since pid allocation is
hierarchial this can be ensured by checking each pid allocation against
the pid namespace's pid_max limit. This means if the allocation in the
descendant pid namespace succeeds, the ancestor pid namespace can reject
it. If the ancestor pid namespace has a higher limit than the descendant
pid namespace the descendant pid namespace will reject the pid
allocation. The ancestor pid namespace will obviously not care about
this.
All in all this means pid_max continues to enforce a system wide limit
on the number of processes but allows pid namespaces sufficient leeway
in handling workloads with assumptions about pid values and allows
containers to restrict the number of processes in a pid namespace
through the pid_max interface.

[1]: https://lore.kernel.org/linux-api/CAHk-=wiZ40LVjnXSi9iHLE_-ZBsWFGCgdmNiYZUXn1-V5YBg2g@mail.gmail.com
- rebased from 5.14-rc1
- a few fixes (missing ns_free_inum on error path, missing initialization, etc)
- permission check changes in pid_table_root_permissions
- unsigned int pid_max -> int pid_max (keep pid_max type as it was)
- add READ_ONCE in alloc_pid() as suggested by Christian
- rebased from 6.7 and take into account:
* sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
* sysctl: treewide: constify ctl_table_header::ctl_table_arg
* pidfd: add pidfs
* tracing: Move saved_cmdline code into trace_sched_switch.c

Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1
# 9d9539db 12-Mar-2024 Christian Brauner <[email protected]>

pidfs: remove config option

As Linus suggested this enables pidfs unconditionally. A key property to
retain is the ability to compare pidfds by inode number (cf. [1]).
That's extremely helpful just

pidfs: remove config option

As Linus suggested this enables pidfs unconditionally. A key property to
retain is the ability to compare pidfds by inode number (cf. [1]).
That's extremely helpful just as comparing namespace file descriptors by
inode number is. They are used in a variety of scenarios where they need
to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a
socket to trivially authenticate a the sender and various other
use-cases.

For 64bit systems this is pretty trivial to do. For 32bit it's slightly
more annoying as we discussed but we simply add a dumb ida based
allocator that gets used on 32bit. This gives the same guarantees about
inode numbers on 64bit without any overflow risk. Practically, we'll
never run into overflow issues because we're constrained by the number
of processes that can exist on 32bit and by the number of open files
that can exist on a 32bit system. On 64bit none of this matters and
things are very simple.

If 32bit also needs the uniqueness guarantee they can simply parse the
contents of /proc/<pid>/fd/<nr>. The uniqueness guarantees have a
variety of use-cases. One of the most obvious ones is that they will
make pidfiles (or "pidfdfiles", I guess) reliable as the unique
identifier can be placed into there that won't be reycled. Also a
frequent request.

Note, I took the chance and simplified path_from_stashed() even further.
Instead of passing the inode number explicitly to path_from_stashed() we
let the filesystem handle that internally. So path_from_stashed() ends
up even simpler than it is now. This is also a good solution allowing
the cleanup code to be clean and consistent between 32bit and 64bit. The
cleanup path in prepare_anon_dentry() is also switched around so we put
the inode before the dentry allocation. This means we only have to call
the cleanup handler for the filesystem's inode data once and can rely
->evict_inode() otherwise.

Aside from having to have a bit of extra code for 32bit it actually ends
up a nice cleanup for path_from_stashed() imho.

Tested on both 32 and 64bit including error injection.

Link: https://github.com/systemd/systemd/pull/31713 [1]
Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v6.8, v6.8-rc7, v6.8-rc6
# b28ddcc3 19-Feb-2024 Christian Brauner <[email protected]>

pidfs: convert to path_from_stashed() helper

Moving pidfds from the anonymous inode infrastructure to a separate tiny
in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes
selinux

pidfs: convert to path_from_stashed() helper

Moving pidfds from the anonymous inode infrastructure to a separate tiny
in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes
selinux denials and thus various userspace components that make heavy
use of pidfds to fail as pidfds used anon_inode_getfile() which aren't
subject to any LSM hooks. But dentry_open() is and that would cause
regressions.

The failures that are seen are selinux denials. But the core failure is
dbus-broker. That cascades into other services failing that depend on
dbus-broker. For example, when dbus-broker fails to start polkit and all
the others won't be able to work because they depend on dbus-broker.

The reason for dbus-broker failing is because it doesn't handle failures
for SO_PEERPIDFD correctly. Last kernel release we introduced
SO_PEERPIDFD (and SCM_PIDFD). SO_PEERPIDFD allows dbus-broker and polkit
and others to receive a pidfd for the peer of an AF_UNIX socket. This is
the first time in the history of Linux that we can safely authenticate
clients in a race-free manner.

dbus-broker immediately made use of this but messed up the error
checking. It only allowed EINVAL as a valid failure for SO_PEERPIDFD.
That's obviously problematic not just because of LSM denials but because
of seccomp denials that would prevent SO_PEERPIDFD from working; or any
other new error code from there.

So this is catching a flawed implementation in dbus-broker as well. It
has to fallback to the old pid-based authentication when SO_PEERPIDFD
doesn't work no matter the reasons otherwise it'll always risk such
failures. So overall that LSM denial should not have caused dbus-broker
to fail. It can never assume that a feature released one kernel ago like
SO_PEERPIDFD can be assumed to be available.

So, the next fix separate from the selinux policy update is to try and
fix dbus-broker at [3]. That should make it into Fedora as well. In
addition the selinux reference policy should also be updated. See [4]
for that. If Selinux is in enforcing mode in userspace and it encounters
anything that it doesn't know about it will deny it by default. And the
policy is entirely in userspace including declaring new types for stuff
like nsfs or pidfs to allow it.

For now we continue to raise S_PRIVATE on the inode if it's a pidfs
inode which means things behave exactly like before.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2265630
Link: https://github.com/fedora-selinux/selinux-policy/pull/2050
Link: https://github.com/bus1/dbus-broker/pull/343 [3]
Link: https://github.com/SELinuxProject/refpolicy/pull/762 [4]
Reported-by: Nathan Chancellor <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.8-rc5
# cb12fd8e 12-Feb-2024 Christian Brauner <[email protected]>

pidfd: add pidfs

This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't abl

pidfd: add pidfs

This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't able to do simply because of the
very justified limitations of anonymous inodes. Moving pidfds to a tiny
pseudo filesystem allows:

* statx() on pidfds becomes useful for the first time.
* pidfds can be compared simply via statx() and then comparing inode
numbers.
* pidfds have unique inode numbers for the system lifetime.
* struct pid is now stashed in inode->i_private instead of
file->private_data. This means it is now possible to introduce
concepts that operate on a process once all file descriptors have been
closed. A concrete example is kill-on-last-close.
* file->private_data is freed up for per-file options for pidfds.
* Each struct pid will refer to a different inode but the same struct
pid will refer to the same inode if it's opened multiple times. In
contrast to now where each struct pid refers to the same inode. Even
if we were to move to anon_inode_create_getfile() which creates new
inodes we'd still be associating the same struct pid with multiple
different inodes.

The tiny pseudo filesystem is not visible anywhere in userspace exactly
like e.g., pipefs and sockfs. There's no lookup, there's no complex
inode operations, nothing. Dentries and inodes are always deleted when
the last pidfd is closed.

We allocate a new inode for each struct pid and we reuse that inode for
all pidfds. We use iget_locked() to find that inode again based on the
inode number which isn't recycled. We allocate a new dentry for each
pidfd that uses the same inode. That is similar to anonymous inodes
which reuse the same inode for thousands of dentries. For pidfds we're
talking way less than that. There usually won't be a lot of concurrent
openers of the same struct pid. They can probably often be counted on
two hands. I know that systemd does use separate pidfd for the same
struct pid for various complex process tracking issues. So I think with
that things actually become way simpler. Especially because we don't
have to care about lookup. Dentries and inodes continue to be always
deleted.

The code is entirely optional and fairly small. If it's not selected we
fallback to anonymous inodes. Heavily inspired by nsfs which uses a
similar stashing mechanism just for namespaces.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.8-rc4, v6.8-rc3
# 64bef697 31-Jan-2024 Oleg Nesterov <[email protected]>

pidfd: implement PIDFD_THREAD flag for pidfd_open()

With this flag:

- pidfd_open() doesn't require that the target task must be
a thread-group leader

- pidfd_poll() succeeds when the task exi

pidfd: implement PIDFD_THREAD flag for pidfd_open()

With this flag:

- pidfd_open() doesn't require that the target task must be
a thread-group leader

- pidfd_poll() succeeds when the task exits and becomes a
zombie (iow, passes exit_notify()), even if it is a leader
and thread-group is not empty.

This means that the behaviour of pidfd_poll(PIDFD_THREAD,
pid-of-group-leader) is not well defined if it races with
exec() from its sub-thread; pidfd_poll() can succeed or not
depending on whether pidfd_task_exited() is called before
or after exchange_tids().

Perhaps we can improve this behaviour later, pidfd_poll()
can probably take sig->group_exec_task into account. But
this doesn't really differ from the case when the leader
exits before other threads (so pidfd_poll() succeeds) and
then another thread execs and pidfd_poll() will block again.

thread_group_exited() is no longer used, perhaps it can die.

Co-developed-by: Tycho Andersen <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Tycho Andersen <[email protected]>
Reviewed-by: Tycho Andersen <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.8-rc2
# cdefbf23 25-Jan-2024 Oleg Nesterov <[email protected]>

pidfd: cleanup the usage of __pidfd_prepare's flags

- make pidfd_create() static.

- Don't pass O_RDWR | O_CLOEXEC to __pidfd_prepare() in copy_process(),
__pidfd_prepare() adds these flags uncond

pidfd: cleanup the usage of __pidfd_prepare's flags

- make pidfd_create() static.

- Don't pass O_RDWR | O_CLOEXEC to __pidfd_prepare() in copy_process(),
__pidfd_prepare() adds these flags unconditionally.

- Kill the flags check in __pidfd_prepare(). sys_pidfd_open() checks the
flags itself, all other users of pidfd_prepare() pass flags = 0.

If we need a sanity check for those other in kernel users then
WARN_ON_ONCE(flags & ~PIDFD_NONBLOCK) makes more sense.

- Don't pass O_RDWR to get_unused_fd_flags(), it ignores everything except
O_CLOEXEC.

- Don't pass O_CLOEXEC to anon_inode_getfile(), it ignores everything except
O_ACCMODE | O_NONBLOCK.

Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6
# f551103c 15-Dec-2023 Kent Overstreet <[email protected]>

sched.h: move pid helpers to pid.h

This is needed for killing the sched.h dependency on rcupdate.h, and
pid.h is a better place for this code anyways.

Signed-off-by: Kent Overstreet <kent.overstree

sched.h: move pid helpers to pid.h

This is needed for killing the sched.h dependency on rcupdate.h, and
pid.h is a better place for this code anyways.

Signed-off-by: Kent Overstreet <[email protected]>

show more ...


# 6d5e9d63 11-Dec-2023 Kent Overstreet <[email protected]>

pid: Split out pid_types.h

Trimming down sched.h dependencies: we dont't want to include more than
the base types.

Cc: Kees Cook <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc

pid: Split out pid_types.h

Trimming down sched.h dependencies: we dont't want to include more than
the base types.

Cc: Kees Cook <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Will Drewry <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>

show more ...


Revision tags: v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1
# b69f0aeb 30-Jun-2023 Kees Cook <[email protected]>

pid: Replace struct pid 1-element array with flex-array

For pid namespaces, struct pid uses a dynamically sized array member,
"numbers". This was implemented using the ancient 1-element fake
flexib

pid: Replace struct pid 1-element array with flex-array

For pid namespaces, struct pid uses a dynamically sized array member,
"numbers". This was implemented using the ancient 1-element fake
flexible array, which has been deprecated for decades.

Replace it with a C99 flexible array, refactor the array size
calculations to use struct_size(), and address elements via indexes.
Note that the static initializer (which defines a single element) works
as-is, and requires no special handling.

Without this, CONFIG_UBSAN_BOUNDS (and potentially
CONFIG_FORTIFY_SOURCE) will trigger bounds checks:

https://lore.kernel.org/lkml/20230517-bushaltestelle-super-e223978c1ba6@brauner

Cc: Christian Brauner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Daniel Verkamp <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Reported-by: [email protected]
[brauner: dropped unrelated changes and remove 0 with NULL cast]
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5
# 6ae930d9 27-Mar-2023 Christian Brauner <[email protected]>

pid: add pidfd_prepare()

Add a new helper that allows to reserve a pidfd and allocates a new
pidfd file that stashes the provided struct pid. This will allow us to
remove places that either open cod

pid: add pidfd_prepare()

Add a new helper that allows to reserve a pidfd and allocates a new
pidfd file that stashes the provided struct pid. This will allow us to
remove places that either open code this function or that call
pidfd_create() but then have to call close_fd() because there are still
failure points after pidfd_create() has been called.

Reviewed-by: Jan Kara <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6
# e9bdcdbf 11-Oct-2021 Christian Brauner <[email protected]>

pid: add pidfd_get_task() helper

The number of system calls making use of pidfds is constantly
increasing. Some of those new system calls duplicate the code to turn a
pidfd into task_struct it refer

pid: add pidfd_get_task() helper

The number of system calls making use of pidfds is constantly
increasing. Some of those new system calls duplicate the code to turn a
pidfd into task_struct it refers to. Give them a simple helper for this.

Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Cc: Vlastimil Babka <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Matthew Bobrowski <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Minchan Kim <[email protected]>
Reviewed-by: Matthew Bobrowski <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5
# c576e0fc 08-Aug-2021 Matthew Bobrowski <[email protected]>

kernel/pid.c: remove static qualifier from pidfd_create()

With the idea of returning pidfds from the fanotify API, we need to
expose a mechanism for creating pidfds. We drop the static qualifier
fro

kernel/pid.c: remove static qualifier from pidfd_create()

With the idea of returning pidfds from the fanotify API, we need to
expose a mechanism for creating pidfds. We drop the static qualifier
from pidfd_create() and add its declaration to linux/pid.h so that the
pidfd_create() helper can be called from other kernel subsystems
i.e. fanotify.

Link: https://lore.kernel.org/r/0c68653ec32f1b7143301f0231f7ed14062fd82b.1628398044.git.repnop@google.com
Signed-off-by: Matthew Bobrowski <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Signed-off-by: Jan Kara <[email protected]>

show more ...


Revision tags: v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1
# 1aa92cd3 17-Oct-2020 Minchan Kim <[email protected]>

pid: move pidfd_get_pid() to pid.c

process_madvise syscall needs pidfd_get_pid function to translate pidfd to
pid so this patch move the function to kernel/pid.c.

Suggested-by: Alexander Duyck <ale

pid: move pidfd_get_pid() to pid.c

process_madvise syscall needs pidfd_get_pid function to translate pidfd to
pid so this patch move the function to kernel/pid.c.

Suggested-by: Alexander Duyck <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Daniel Colascione <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Dias <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oleksandr Natalenko <[email protected]>
Cc: Sandeep Patil <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Sonny Rao <[email protected]>
Cc: Tim Murray <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2
# 6b03d130 19-Apr-2020 Eric W. Biederman <[email protected]>

proc: Ensure we see the exit of each process tid exactly once

When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the ent

proc: Ensure we see the exit of each process tid exactly once

When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.

Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.

When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.

This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization"). Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.

The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently. Fix
this just so there are no little surprise corner cases waiting to bite
people.

-- Oleg points out this could be an issue in next_tgid in proc where
has_group_leader_pid is called, and reording some of the assignments
should fix that.

-- Oleg points out this will break the 10 year old hack in __exit_signal.c
> /*
> * This can only happen if the caller is de_thread().
> * FIXME: this is the temporary hack, we should teach
> * posix-cpu-timers to handle this case correctly.
> */
> if (unlikely(has_group_leader_pid(tsk)))
> posix_cpu_timers_exit_group(tsk);

The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.

Link: https://lore.kernel.org/lkml/[email protected]/
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <[email protected]>

show more ...


# 2374c09b 24-Apr-2020 Christoph Hellwig <[email protected]>

sysctl: remove all extern declaration from sysctl.c

Extern declarations in .c files are a bad style and can lead to
mismatches. Use existing definitions in headers where they exist,
and otherwise m

sysctl: remove all extern declaration from sysctl.c

Extern declarations in .c files are a bad style and can lead to
mismatches. Use existing definitions in headers where they exist,
and otherwise move the external declarations to suitable header
files.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


Revision tags: v5.7-rc1
# 63f818f4 07-Apr-2020 Eric W. Biederman <[email protected]>

proc: Use a dedicated lock in struct pid

syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not

proc: Use a dedicated lock in struct pid

syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not tainted
> --------------------------------------------------------
> swapper/1/0 just changed the state of lock:
> ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
> but this lock took another, SOFTIRQ-unsafe lock in the past:
> (&pid->wait_pidfd){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&pid->wait_pidfd);
> local_irq_disable();
> lock(tasklist_lock);
> lock(&pid->wait_pidfd);
> <Interrupt>
> lock(tasklist_lock);
>
> *** DEADLOCK ***
>
> 4 locks held by swapper/1/0:

The problem is that because wait_pidfd.lock is taken under the tasklist
lock. It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.

Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock. That should be safe and I think the code will eventually
do that. It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.

It is also true that my pre-merge window testing was insufficient. So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things. I have verified that lockdep sees all 3
paths where we take the new pid->lock and lockdep does not complain.

It is my current day dream that one day pid->lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals. That will require taking pid->lock with irqs disabled.

Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Cc: Oleg Nesterov <[email protected]>
Cc: Christian Brauner <[email protected]>
Reported-by: [email protected]
Reported-by: [email protected]
Reported-by: [email protected]
Reported-by: [email protected]
Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman" <[email protected]>

show more ...


Revision tags: v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3
# 7bc3e6e5 20-Feb-2020 Eric W. Biederman <[email protected]>

proc: Use a list of inodes to flush from proc

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as the

proc: Use a list of inodes to flush from proc

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization. The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died. When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid. Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid. Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place. This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes. I somehow failed to get that
initialization into the initial version of the patch. A boot
failure was reported by "kernel test robot <[email protected]>", and
failure to initialize that pid->inodes matches all of the reported
symptoms.

Signed-off-by: Eric W. Biederman <[email protected]>

show more ...


Revision tags: v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3, v5.5-rc2, v5.5-rc1, v5.4, v5.4-rc8
# 49cb2fc4 15-Nov-2019 Adrian Reber <[email protected]>

fork: extend clone3() to support setting a PID

The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. I

fork: extend clone3() to support setting a PID

The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

PID NS level Requested PID
0 (host) 31496
1 42
2 1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

set_tid[0] = 1;
set_tid[1] = 42;
set_tid[2] = 31496;
set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

set_tid[0] = 1;
set_tid[1] = 42;
set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Reviewed-by: Dmitry Safonov <[email protected]>
Acked-by: Andrei Vagin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v5.4-rc7, v5.4-rc6, v5.4-rc5, v5.4-rc4
# 3d6d8da4 17-Oct-2019 Christian Brauner <[email protected]>

pidfd: check pid has attached task in fdinfo

Currently, when a task is dead we still print the pid it used to use in
the fdinfo files of its pidfds. This doesn't make much sense since the
pid may ha

pidfd: check pid has attached task in fdinfo

Currently, when a task is dead we still print the pid it used to use in
the fdinfo files of its pidfds. This doesn't make much sense since the
pid may have already been reused. So verify that the task is still alive
by introducing the pid_has_task() helper which will be used by other
callers in follow-up patches.
If the task is not alive anymore, we will print -1. This allows us to
differentiate between a task not being present in a given pid namespace
- in which case we already print 0 - and a task having been reaped.

Note that this uses PIDTYPE_PID for the check. Technically, we could've
checked PIDTYPE_TGID since pidfds currently only refer to thread-group
leaders but if they won't anymore in the future then this check becomes
problematic without it being immediately obvious to non-experts imho. If
a thread is created via clone(CLONE_THREAD) than struct pid has a single
non-empty list pid->tasks[PIDTYPE_PID] and this pid can't be used as a
PIDTYPE_TGID meaning pid->tasks[PIDTYPE_TGID] will return NULL even
though the thread-group leader might still be very much alive. So
checking PIDTYPE_PID is fine and is easier to maintain should we ever
allow pidfds to refer to threads.

Cc: Jann Horn <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: [email protected]
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v5.4-rc3, v5.4-rc2, v5.4-rc1, v5.3, v5.3-rc8, v5.3-rc7, v5.3-rc6, v5.3-rc5, v5.3-rc4, v5.3-rc3, v5.3-rc2
# 3695eae5 27-Jul-2019 Christian Brauner <[email protected]>

pidfd: add P_PIDFD to waitid()

This adds the P_PIDFD type to waitid().
One of the last remaining bits for the pidfd api is to make it possible
to wait on pidfds. With P_PIDFD added to waitid() the p

pidfd: add P_PIDFD to waitid()

This adds the P_PIDFD type to waitid().
One of the last remaining bits for the pidfd api is to make it possible
to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
that want to use the pidfd api to exclusively manage processes can do so
now.

One of the things this will unblock in the future is the ability to make
it possible to retrieve the exit status via waitid(P_PIDFD) for
non-parent processes if handed a _suitable_ pidfd that has this feature
set. This is similar to what you can do on FreeBSD with kqueue(). It
might even end up being possible to wait on a process as a non-parent if
an appropriate property is enabled on the pidfd.

With P_PIDFD no scoping of the process identified by the pidfd is
possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
rely on pid-based wait*() syscalls for now.

Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v5.3-rc1
# f57e515a 16-Jul-2019 Joel Fernandes (Google) <[email protected]>

kernel/pid.c: convert struct pid count to refcount_t

struct pid's count is an atomic_t field used as a refcount. Use
refcount_t for it which is basically atomic_t but does additional
checking to pr

kernel/pid.c: convert struct pid count to refcount_t

struct pid's count is an atomic_t field used as a refcount. Use
refcount_t for it which is basically atomic_t but does additional
checking to prevent use-after-free bugs.

For memory ordering, the only change is with the following:

- if ((atomic_read(&pid->count) == 1) ||
- atomic_dec_and_test(&pid->count)) {
+ if (refcount_dec_and_test(&pid->count)) {
kmem_cache_free(ns->pid_cachep, pid);

Here the change is from: Fully ordered --> RELEASE + ACQUIRE (as per
refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the
free happens after the refcount_dec_and_test().

The above hunk also removes atomic_read() since it is not needed for the
code to work and it is unclear how beneficial it is. The removal lets
refcount_dec_and_test() check for cases where get_pid() happened before
the object was freed.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Reviewed-by: Andrea Parri <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Elena Reshetova <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: KJ Tsanaktsidis <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v5.2, v5.2-rc7, v5.2-rc6, v5.2-rc5, v5.2-rc4, v5.2-rc3, v5.2-rc2, v5.2-rc1, v5.1
# b53b0b9d 30-Apr-2019 Joel Fernandes (Google) <[email protected]>

pidfd: add polling support

This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for th

pidfd: add polling support

This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Daniel Colascione <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Tim Murray <[email protected]>
Cc: Jonathan Kowalski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: David Howells <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: [email protected]
Reviewed-by: Oleg Nesterov <[email protected]>
Co-developed-by: Daniel Colascione <[email protected]>
Signed-off-by: Daniel Colascione <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v5.1-rc7, v5.1-rc6, v5.1-rc5, v5.1-rc4, v5.1-rc3
# b3e58382 27-Mar-2019 Christian Brauner <[email protected]>

clone: add CLONE_PIDFD

This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call. Linus originally

clone: add CLONE_PIDFD

This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call. Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call. As
spotted by Linus, there is exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api. They serve as a simple opaque handle on pids. Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending). Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege. This does not imply it
cannot ever be that way but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs. The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone. This has the advantage that we can
give back the associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Co-developed-by: Jann Horn <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>

show more ...


123