History log of /linux-6.15/include/linux/pid_namespace.h (Results 1 – 25 of 51)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1
# 7863dcc7 22-Nov-2024 Christian Brauner <[email protected]>

pid: allow pid_max to be set per pid namespace

The pid_max sysctl is a global value. For a long time the default value
has been 65535 and during the pidfd dicussions Linus proposed to bump
pid_max b

pid: allow pid_max to be set per pid namespace

The pid_max sysctl is a global value. For a long time the default value
has been 65535 and during the pidfd dicussions Linus proposed to bump
pid_max by default (cf. [1]). Based on this discussion systemd started
bumping pid_max to 2^22. So all new systems now run with a very high
pid_max limit with some distros having also backported that change.
The decision to bump pid_max is obviously correct. It just doesn't make
a lot of sense nowadays to enforce such a low pid number. There's
sufficient tooling to make selecting specific processes without typing
really large pid numbers available.

In any case, there are workloads that have expections about how large
pid numbers they accept. Either for historical reasons or architectural
reasons. One concreate example is the 32-bit version of Android's bionic
libc which requires pid numbers less than 65536. There are workloads
where it is run in a 32-bit container on a 64-bit kernel. If the host
has a pid_max value greater than 65535 the libc will abort thread
creation because of size assumptions of pthread_mutex_t.

That's a fairly specific use-case however, in general specific workloads
that are moved into containers running on a host with a new kernel and a
new systemd can run into issues with large pid_max values. Obviously
making assumptions about the size of the allocated pid is suboptimal but
we have userspace that does it.

Of course, giving containers the ability to restrict the number of
processes in their respective pid namespace indepent of the global limit
through pid_max is something desirable in itself and comes in handy in
general.

Independent of motivating use-cases the existence of pid namespaces
makes this also a good semantical extension and there have been prior
proposals pushing in a similar direction.
The trick here is to minimize the risk of regressions which I think is
doable. The fact that pid namespaces are hierarchical will help us here.

What we mostly care about is that when the host sets a low pid_max
limit, say (crazy number) 100 that no descendant pid namespace can
allocate a higher pid number in its namespace. Since pid allocation is
hierarchial this can be ensured by checking each pid allocation against
the pid namespace's pid_max limit. This means if the allocation in the
descendant pid namespace succeeds, the ancestor pid namespace can reject
it. If the ancestor pid namespace has a higher limit than the descendant
pid namespace the descendant pid namespace will reject the pid
allocation. The ancestor pid namespace will obviously not care about
this.
All in all this means pid_max continues to enforce a system wide limit
on the number of processes but allows pid namespaces sufficient leeway
in handling workloads with assumptions about pid values and allows
containers to restrict the number of processes in a pid namespace
through the pid_max interface.

[1]: https://lore.kernel.org/linux-api/CAHk-=wiZ40LVjnXSi9iHLE_-ZBsWFGCgdmNiYZUXn1-V5YBg2g@mail.gmail.com
- rebased from 5.14-rc1
- a few fixes (missing ns_free_inum on error path, missing initialization, etc)
- permission check changes in pid_table_root_permissions
- unsigned int pid_max -> int pid_max (keep pid_max type as it was)
- add READ_ONCE in alloc_pid() as suggested by Christian
- rebased from 6.7 and take into account:
* sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
* sysctl: treewide: constify ctl_table_header::ctl_table_arg
* pidfd: add pidfs
* tracing: Move saved_cmdline code into trace_sched_switch.c

Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7
# 9876cfe8 14-Aug-2023 Aleksa Sarai <[email protected]>

memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy

This sysctl has the very unusual behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, mean

memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy

This sysctl has the very unusual behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were
to set this sysctl to a more restrictive option in the host pidns you
would need to reboot your machine in order to reset it.

The justification given in [1] is that this is a security feature and thus
it should not be possible to disable. Aside from the fact that we have
plenty of security-related sysctls that can be disabled after being
enabled (fs.protected_symlinks for instance), the protection provided by
the sysctl is to stop users from being able to create a binary and then
execute it. A user with CAP_SYS_ADMIN can trivially do this without
memfd_create(2):

% cat mount-memfd.c
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/mount.h>

#define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"

int main(void)
{
int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
assert(fsfd >= 0);
assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));

int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
assert(dfd >= 0);

int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
assert(execfd >= 0);
assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
assert(!close(execfd));

char *execpath = NULL;
char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
assert(execfd >= 0);
assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
assert(!execve(execpath, argv, envp));
}
% ./mount-memfd
this file was executed from this totally private tmpfs: /proc/self/fd/5
%

Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host filesystem
(not to mention the many other things a CAP_SYS_ADMIN process would be
able to do that would be equivalent or worse), it seems strange to cause a
fair amount of headache to admins when there doesn't appear to be an
actual security benefit to blocking this. There appear to be concerns
about confused-deputy-esque attacks[2] but a confused deputy that can
write to arbitrary sysctls is a bigger security issue than executable
memfds.

/* New API */

The primary requirement from the original author appears to be more based
on the need to be able to restrict an entire system in a hierarchical
manner[3], such that child namespaces cannot re-enable executable memfds.

So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
evaluated up the pidns tree to &init_pid_ns and you have the most
restrictive value applied to you. The new lower limit you can set
vm.memfd_noexec is whatever limit applies to your parent.

Note that a pidns will inherit a copy of the parent pidns's effective
vm.memfd_noexec setting at unshare() time. This matches the existing
behaviour, and it also ensures that a pidns will never have its
vm.memfd_noexec setting *lowered* behind its back (but it will be raised
if the parent raises theirs).

/* Backwards Compatibility */

As the previous version of the sysctl didn't allow you to lower the
setting at all, there are no backwards compatibility issues with this
aspect of the change.

However it should be noted that now that the setting is completely
hierarchical. Previously, a cloned pidns would just copy the current
pidns setting, meaning that if the parent's vm.memfd_noexec was changed it
wouldn't propoagate to existing pid namespaces. Now, the restriction
applies recursively. This is a uAPI change, however:

* The sysctl is very new, having been merged in 6.3.
* Several aspects of the sysctl were broken up until this patchset and
the other patchset by Jeff Xu last month.

And thus it seems incredibly unlikely that any real users would run into
this issue. In the worst case, if this causes userspace isues we could
make it so that modifying the setting follows the hierarchical rules but
the restriction checking uses the cached copy.

[1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
[2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/
[3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Daniel Verkamp <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 202e1422 14-Aug-2023 Aleksa Sarai <[email protected]>

memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2

Given the difficulty of auditing all of userspace to figure out whether
every memfd_create() user has switched to passing MFD_EX

memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2

Given the difficulty of auditing all of userspace to figure out whether
every memfd_create() user has switched to passing MFD_EXEC and
MFD_NOEXEC_SEAL flags, it seems far less distruptive to make it possible
for older programs that don't make use of executable memfds to run under
vm.memfd_noexec=2. Otherwise, a small dependency change can result in
spurious errors. For programs that don't use executable memfds, passing
MFD_NOEXEC_SEAL is functionally a no-op and thus having the same

In addition, every failure under vm.memfd_noexec=2 needs to print to the
kernel log so that userspace can figure out where the error came from.
The concerns about pr_warn_ratelimited() spam that caused the switch to
pr_warn_once()[1,2] do not apply to the vm.memfd_noexec=2 case.

This is a user-visible API change, but as it allows programs to do
something that would be blocked before, and the sysctl itself was broken
and recently released, it seems unlikely this will cause any issues.

[1]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
[2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Daniel Verkamp <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1
# 105ff533 15-Dec-2022 Jeff Xu <[email protected]>

mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set
executable bit at creation time (memfd_create).

When MFD_NOEXEC_SEAL is set, memfd is

mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set
executable bit at creation time (memfd_create).

When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be
executable (mode: 0777) after creation.

when MFD_EXEC flag is set, memfd is created with executable bit
(mode:0777), this is the same as the old behavior of memfd_create.

The new pid namespaced sysctl vm.memfd_noexec has 3 values:
0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_EXEC was set.
1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_NOEXEC_SEAL was set.
2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.

The sysctl allows finer control of memfd_create for old-software that
doesn't set the executable bit, for example, a container with
vm.memfd_noexec=1 means the old-software will create non-executable memfd
by default. Also, the value of memfd_noexec is passed to child namespace
at creation time. For example, if the init namespace has
vm.memfd_noexec=2, all its children namespaces will be created with 2.

[[email protected]: add stub functions to fix build]
[[email protected]: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff]
[[email protected]: s/pr_warn_ratelimited/pr_warn_once/, per review]
[[email protected]: fix CONFIG_SYSCTL=n warning]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jeff Xu <[email protected]>
Co-developed-by: Daniel Verkamp <[email protected]>
Signed-off-by: Daniel Verkamp <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: David Herrmann <[email protected]>
Cc: Dmitry Torokhov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jorge Lucangeli Obes <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2
# d7e4f854 26-Jan-2022 Leo Yan <[email protected]>

pid: Introduce helper task_is_in_init_pid_ns()

Currently the kernel uses open code in multiple places to check if a
task is in the root PID namespace with the kind of format:

if (task_active_pid_

pid: Introduce helper task_is_in_init_pid_ns()

Currently the kernel uses open code in multiple places to check if a
task is in the root PID namespace with the kind of format:

if (task_active_pid_ns(current) == &init_pid_ns)
do_something();

This patch creates a new helper function, task_is_in_init_pid_ns(), it
returns true if a passed task is in the root PID namespace, otherwise
returns false. So it will be used to replace open codes.

Suggested-by: Suzuki K Poulose <[email protected]>
Signed-off-by: Leo Yan <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Acked-by: Suzuki K Poulose <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1
# 8eb71d95 03-Aug-2020 Kirill Tkhai <[email protected]>

pid: Use generic ns_common::count

Switch over pid namespaces to use the newly introduced common lifetime
counter.

Currently every namespace type has its own lifetime counter which is stored
in the

pid: Use generic ns_common::count

Switch over pid namespaces to use the newly introduced common lifetime
counter.

Currently every namespace type has its own lifetime counter which is stored
in the specific namespace struct. The lifetime counters are used
identically for all namespaces types. Namespaces may of course have
additional unrelated counters and these are not altered.

This introduces a common lifetime counter into struct ns_common. The
ns_common struct encompasses information that all namespaces share. That
should include the lifetime counter since its common for all of them.

It also allows us to unify the type of the counters across all namespaces.
Most of them use refcount_t but one uses atomic_t and at least one uses
kref. Especially the last one doesn't make much sense since it's just a
wrapper around refcount_t since 2016 and actually complicates cleanup
operations by having to use container_of() to cast the correct namespace
struct out of struct ns_common.

Having the lifetime counter for the namespaces in one place reduces
maintenance cost. Not just because after switching all namespaces over we
will have removed more code than we added but also because the logic is
more easily understandable and we indicate to the user that the basic
lifetime requirements for all namespaces are currently identical.

Signed-off-by: Kirill Tkhai <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/159644979226.604812.7512601754841882036.stgit@localhost.localdomain
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2
# fa10fed3 19-Apr-2020 Alexey Gladkov <[email protected]>

proc: allow to mount many instances of proc in one pid namespace

This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allo

proc: allow to mount many instances of proc in one pid namespace

This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.

1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.

2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.

This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.

By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.

Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...

In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.

Selftest has been added to verify new behavior.

Signed-off-by: Alexey Gladkov <[email protected]>
Reviewed-by: Alexey Dobriyan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>

show more ...


Revision tags: v5.7-rc1, v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3
# 69879c01 20-Feb-2020 Eric W. Biederman <[email protected]>

proc: Remove the now unnecessary internal mount of proc

There remains no more code in the kernel using pids_ns->proc_mnt,
therefore remove it from the kernel.

The big benefit of this change is that

proc: Remove the now unnecessary internal mount of proc

There remains no more code in the kernel using pids_ns->proc_mnt,
therefore remove it from the kernel.

The big benefit of this change is that one of the most error prone and
tricky parts of the pid namespace implementation, maintaining kernel
mounts of proc is removed.

In addition removing the unnecessary complexity of the kernel mount
fixes a regression that caused the proc mount options to be ignored.
Now that the initial mount of proc comes from userspace, those mount
options are again honored. This fixes Android's usage of the proc
hidepid option.

Reported-by: Alistair Strachan <[email protected]>
Fixes: e94591d0d90c ("proc: Convert proc_mount to use mount_ns.")
Signed-off-by: "Eric W. Biederman" <[email protected]>

show more ...


Revision tags: v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3, v5.5-rc2, v5.5-rc1, v5.4, v5.4-rc8
# 49cb2fc4 15-Nov-2019 Adrian Reber <[email protected]>

fork: extend clone3() to support setting a PID

The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. I

fork: extend clone3() to support setting a PID

The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

PID NS level Requested PID
0 (host) 31496
1 42
2 1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

set_tid[0] = 1;
set_tid[1] = 42;
set_tid[2] = 31496;
set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

set_tid[0] = 1;
set_tid[1] = 42;
set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Reviewed-by: Dmitry Safonov <[email protected]>
Acked-by: Andrei Vagin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v5.4-rc7, v5.4-rc6, v5.4-rc5, v5.4-rc4, v5.4-rc3, v5.4-rc2, v5.4-rc1, v5.3, v5.3-rc8, v5.3-rc7, v5.3-rc6, v5.3-rc5, v5.3-rc4, v5.3-rc3, v5.3-rc2, v5.3-rc1, v5.2, v5.2-rc7, v5.2-rc6, v5.2-rc5, v5.2-rc4, v5.2-rc3, v5.2-rc2, v5.2-rc1, v5.1, v5.1-rc7, v5.1-rc6, v5.1-rc5, v5.1-rc4, v5.1-rc3, v5.1-rc2, v5.1-rc1, v5.0, v5.0-rc8, v5.0-rc7, v5.0-rc6, v5.0-rc5, v5.0-rc4, v5.0-rc3, v5.0-rc2, v5.0-rc1, v4.20, v4.20-rc7, v4.20-rc6, v4.20-rc5, v4.20-rc4, v4.20-rc3, v4.20-rc2, v4.20-rc1, v4.19, v4.19-rc8, v4.19-rc7, v4.19-rc6, v4.19-rc5, v4.19-rc4, v4.19-rc3, v4.19-rc2, v4.19-rc1, v4.18, v4.18-rc8, v4.18-rc7, v4.18-rc6, v4.18-rc5, v4.18-rc4, v4.18-rc3, v4.18-rc2, v4.18-rc1, v4.17, v4.17-rc7, v4.17-rc6, v4.17-rc5, v4.17-rc4, v4.17-rc3, v4.17-rc2, v4.17-rc1, v4.16, v4.16-rc7, v4.16-rc6, v4.16-rc5, v4.16-rc4, v4.16-rc3, v4.16-rc2, v4.16-rc1, v4.15, v4.15-rc9, v4.15-rc8, v4.15-rc7, v4.15-rc6, v4.15-rc5, v4.15-rc4, v4.15-rc3, v4.15-rc2, v4.15-rc1
# e8cfbc24 17-Nov-2017 Gargi Sharma <[email protected]>

pid: remove pidhash

pidhash is no longer required as all the information can be looked up
from idr tree. nr_hashed represented the number of pids that had been
hashed. Since, nr_hashed and PIDNS_H

pid: remove pidhash

pidhash is no longer required as all the information can be looked up
from idr tree. nr_hashed represented the number of pids that had been
hashed. Since, nr_hashed and PIDNS_HASH_ADDING are no longer relevant,
it has been renamed to pid_allocated and PIDNS_ADDING respectively.

[[email protected]: v6]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Gargi Sharma <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Tested-by: Tony Luck <[email protected]> [ia64]
Cc: Julia Lawall <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


# 95846ecf 17-Nov-2017 Gargi Sharma <[email protected]>

pid: replace pid bitmap implementation with IDR API

Patch series "Replacing PID bitmap implementation with IDR API", v4.

This series replaces kernel bitmap implementation of PID allocation with
IDR

pid: replace pid bitmap implementation with IDR API

Patch series "Replacing PID bitmap implementation with IDR API", v4.

This series replaces kernel bitmap implementation of PID allocation with
IDR API. These patches are written to simplify the kernel by replacing
custom code with calls to generic code.

The following are the stats for pid and pid_namespace object files
before and after the replacement. There is a noteworthy change between
the IDR and bitmap implementation.

Before
text data bss dec hex filename
8447 3894 64 12405 3075 kernel/pid.o
After
text data bss dec hex filename
3397 304 0 3701 e75 kernel/pid.o

Before
text data bss dec hex filename
5692 1842 192 7726 1e2e kernel/pid_namespace.o
After
text data bss dec hex filename
2854 216 16 3086 c0e kernel/pid_namespace.o

The following are the stats for ps, pstree and calling readdir on /proc
for 10,000 processes.

ps:
With IDR API With bitmap
real 0m1.479s 0m2.319s
user 0m0.070s 0m0.060s
sys 0m0.289s 0m0.516s

pstree:
With IDR API With bitmap
real 0m1.024s 0m1.794s
user 0m0.348s 0m0.612s
sys 0m0.184s 0m0.264s

proc:
With IDR API With bitmap
real 0m0.059s 0m0.074s
user 0m0.000s 0m0.004s
sys 0m0.016s 0m0.016s

This patch (of 2):

Replace the current bitmap implementation for Process ID allocation.
Functions that are no longer required, for example, free_pidmap(),
alloc_pidmap(), etc. are removed. The rest of the functions are
modified to use the IDR API. The change was made to make the PID
allocation less complex by replacing custom code with calls to generic
API.

[[email protected]: v6]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: restore the old behaviour of the ns_last_pid sysctl]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Gargi Sharma <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v4.14, v4.14-rc8
# b2441318 01-Nov-2017 Greg Kroah-Hartman <[email protected]>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <[email protected]>
Reviewed-by: Philippe Ombredanne <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

show more ...


Revision tags: v4.14-rc7, v4.14-rc6, v4.14-rc5, v4.14-rc4, v4.14-rc3, v4.14-rc2, v4.14-rc1, v4.13, v4.13-rc7, v4.13-rc6, v4.13-rc5, v4.13-rc4, v4.13-rc3, v4.13-rc2, v4.13-rc1, v4.12, v4.12-rc7, v4.12-rc6, v4.12-rc5, v4.12-rc4, v4.12-rc3, v4.12-rc2, v4.12-rc1, v4.11, v4.11-rc8, v4.11-rc7, v4.11-rc6, v4.11-rc5, v4.11-rc4, v4.11-rc3, v4.11-rc2, v4.11-rc1, v4.10, v4.10-rc8, v4.10-rc7, v4.10-rc6, v4.10-rc5, v4.10-rc4, v4.10-rc3, v4.10-rc2, v4.10-rc1, v4.9, v4.9-rc8, v4.9-rc7, v4.9-rc6, v4.9-rc5, v4.9-rc4, v4.9-rc3
# 3859a271 28-Oct-2016 Kees Cook <[email protected]>

randstruct: Mark various structs for randomization

This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
con

randstruct: Mark various structs for randomization

This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.

Signed-off-by: Kees Cook <[email protected]>

show more ...


# 796f571b 24-Feb-2017 Lafcadio Wluiki <[email protected]>

procfs: use an enum for possible hidepid values

Previously, the hidepid parameter was checked by comparing literal
integers 0, 1, 2. Let's add a proper enum for this, to make the
checking more expr

procfs: use an enum for possible hidepid values

Previously, the hidepid parameter was checked by comparing literal
integers 0, 1, 2. Let's add a proper enum for this, to make the
checking more expressive:

0 → HIDEPID_OFF
1 → HIDEPID_NO_ACCESS
2 → HIDEPID_INVISIBLE

This changes the internal labelling only, the userspace-facing interface
remains unmodified, and still works with literal integers 0, 1, 2.

No functional changes.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Lafcadio Wluiki <[email protected]>
Signed-off-by: Djalal Harouni <[email protected]>
Acked-by: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v4.9-rc2, v4.9-rc1, v4.8, v4.8-rc8, v4.8-rc7, v4.8-rc6, v4.8-rc5, v4.8-rc4, v4.8-rc3, v4.8-rc2
# f333c700 08-Aug-2016 Eric W. Biederman <[email protected]>

pidns: Add a limit on the number of pid namespaces

Acked-by: Kees Cook <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>


Revision tags: v4.8-rc1, v4.7, v4.7-rc7, v4.7-rc6, v4.7-rc5, v4.7-rc4, v4.7-rc3, v4.7-rc2, v4.7-rc1, v4.6, v4.6-rc7, v4.6-rc6, v4.6-rc5, v4.6-rc4, v4.6-rc3, v4.6-rc2, v4.6-rc1, v4.5, v4.5-rc7, v4.5-rc6, v4.5-rc5, v4.5-rc4, v4.5-rc3, v4.5-rc2, v4.5-rc1, v4.4, v4.4-rc8, v4.4-rc7, v4.4-rc6, v4.4-rc5, v4.4-rc4, v4.4-rc3, v4.4-rc2, v4.4-rc1, v4.3, v4.3-rc7, v4.3-rc6, v4.3-rc5, v4.3-rc4, v4.3-rc3, v4.3-rc2, v4.3-rc1, v4.2, v4.2-rc8, v4.2-rc7, v4.2-rc6, v4.2-rc5, v4.2-rc4, v4.2-rc3, v4.2-rc2, v4.2-rc1, v4.1, v4.1-rc8, v4.1-rc7, v4.1-rc6, v4.1-rc5, v4.1-rc4, v4.1-rc3, v4.1-rc2, v4.1-rc1, v4.0, v4.0-rc7, v4.0-rc6, v4.0-rc5, v4.0-rc4, v4.0-rc3, v4.0-rc2, v4.0-rc1, v3.19, v3.19-rc7, v3.19-rc6, v3.19-rc5, v3.19-rc4
# 59eda0e0 10-Jan-2015 Al Viro <[email protected]>

new fs_pin killing logics

Signed-off-by: Al Viro <[email protected]>


Revision tags: v3.19-rc3, v3.19-rc2, v3.19-rc1, v3.18, v3.18-rc7, v3.18-rc6, v3.18-rc5, v3.18-rc4, v3.18-rc3
# 435d5f4b 01-Nov-2014 Al Viro <[email protected]>

common object embedded into various struct ....ns

for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Al Viro <vi

common object embedded into various struct ....ns

for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


Revision tags: v3.18-rc2, v3.18-rc1, v3.17, v3.17-rc7, v3.17-rc6, v3.17-rc5, v3.17-rc4, v3.17-rc3, v3.17-rc2, v3.17-rc1, v3.16
# 0097875b 31-Jul-2014 Eric W. Biederman <[email protected]>

proc: Implement /proc/thread-self to point at the directory of the current thread

/proc/thread-self is derived from /proc/self. /proc/thread-self
points to the directory in proc containing informat

proc: Implement /proc/thread-self to point at the directory of the current thread

/proc/thread-self is derived from /proc/self. /proc/thread-self
points to the directory in proc containing information about the
current thread.

This funtionality has been missing for a long time, and is tricky to
implement in userspace as gettid() is not exported by glibc. More
importantly this allows fixing defects in /proc/mounts and /proc/net
where in a threaded application today they wind up being empty files
when only the initial pthread has exited, causing problems for other
threads.

Signed-off-by: "Eric W. Biederman" <[email protected]>

show more ...


Revision tags: v3.16-rc7, v3.16-rc6, v3.16-rc5, v3.16-rc4, v3.16-rc3, v3.16-rc2, v3.16-rc1, v3.15, v3.15-rc8, v3.15-rc7, v3.15-rc6, v3.15-rc5, v3.15-rc4, v3.15-rc3, v3.15-rc2, v3.15-rc1, v3.14, v3.14-rc8, v3.14-rc7, v3.14-rc6, v3.14-rc5, v3.14-rc4, v3.14-rc3, v3.14-rc2, v3.14-rc1, v3.13, v3.13-rc8, v3.13-rc7, v3.13-rc6, v3.13-rc5, v3.13-rc4, v3.13-rc3, v3.13-rc2, v3.13-rc1, v3.12, v3.12-rc7, v3.12-rc6, v3.12-rc5, v3.12-rc4
# 1adfcb03 03-Oct-2013 Al Viro <[email protected]>

pid_namespace: make freeing struct pid_namespace rcu-delayed

makes procfs ->premission() instances safety in RCU mode independent
from vfsmount_lock.

Signed-off-by: Al Viro <[email protected]>


Revision tags: v3.12-rc3, v3.12-rc2, v3.12-rc1, v3.11, v3.11-rc7, v3.11-rc6, v3.11-rc5, v3.11-rc4, v3.11-rc3, v3.11-rc2, v3.11-rc1, v3.10, v3.10-rc7, v3.10-rc6, v3.10-rc5, v3.10-rc4, v3.10-rc3, v3.10-rc2, v3.10-rc1
# a27bb332 07-May-2013 Kent Overstreet <[email protected]>

aio: don't include aio.h in sched.h

Faster kernel compiles by way of fewer unnecessary includes.

[[email protected]: fix fallout]
[[email protected]: fix build]
Signed-off-by: Kent

aio: don't include aio.h in sched.h

Faster kernel compiles by way of fewer unnecessary includes.

[[email protected]: fix fallout]
[[email protected]: fix build]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Reviewed-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


# 5cc54451 30-Apr-2013 Raphael S.Carvalho <[email protected]>

pid_namespace.c/.h: simplify defines

Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

[akpm@linux-foundation.

pid_namespace.c/.h: simplify defines

Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

[[email protected]: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
Signed-off-by: Raphael S.Carvalho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v3.9, v3.9-rc8, v3.9-rc7, v3.9-rc6, v3.9-rc5
# 021ada7d 29-Mar-2013 Al Viro <[email protected]>

procfs: switch /proc/self away from proc_dir_entry

Just have it pinned in dcache all along and let procfs ->kill_sb()
drop it before kill_anon_super().

Signed-off-by: Al Viro <[email protected].

procfs: switch /proc/self away from proc_dir_entry

Just have it pinned in dcache all along and let procfs ->kill_sb()
drop it before kill_anon_super().

Signed-off-by: Al Viro <[email protected]>

show more ...


Revision tags: v3.9-rc4, v3.9-rc3, v3.9-rc2, v3.9-rc1, v3.8, v3.8-rc7, v3.8-rc6, v3.8-rc5, v3.8-rc4, v3.8-rc3, v3.8-rc2
# c876ad76 22-Dec-2012 Eric W. Biederman <[email protected]>

pidns: Stop pid allocation when init dies

Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exitin

pidns: Stop pid allocation when init dies

Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exiting.

Can lead to processes attempting access their child reaper and
instead following a stale pointer.

That waitpid for init can return before all of the processes in
the pid namespace have exited is also unfortunate.

Avoid these problems by disabling the allocation of new pids in a pid
namespace when init dies, instead of when the last process in a pid
namespace is reaped.

Pointed-out-by: Oleg Nesterov <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>

show more ...


Revision tags: v3.8-rc1, v3.7, v3.7-rc8, v3.7-rc7, v3.7-rc6, v3.7-rc5, v3.7-rc4, v3.7-rc3, v3.7-rc2, v3.7-rc1, v3.6, v3.6-rc7, v3.6-rc6, v3.6-rc5, v3.6-rc4, v3.6-rc3, v3.6-rc2, v3.6-rc1, v3.5, v3.5-rc7, v3.5-rc6, v3.5-rc5, v3.5-rc4, v3.5-rc3, v3.5-rc2, v3.5-rc1, v3.4, v3.4-rc7, v3.4-rc6, v3.4-rc5, v3.4-rc4, v3.4-rc3, v3.4-rc2, v3.4-rc1, v3.3, v3.3-rc7, v3.3-rc6, v3.3-rc5, v3.3-rc4, v3.3-rc3, v3.3-rc2, v3.3-rc1, v3.2, v3.2-rc7, v3.2-rc6, v3.2-rc5, v3.2-rc4, v3.2-rc3, v3.2-rc2, v3.2-rc1, v3.1, v3.1-rc10, v3.1-rc9, v3.1-rc8, v3.1-rc7, v3.1-rc6, v3.1-rc5, v3.1-rc4, v3.1-rc3, v3.1-rc2, v3.1-rc1, v3.0, v3.0-rc7, v3.0-rc6, v3.0-rc5, v3.0-rc4
# 98f842e6 15-Jun-2011 Eric W. Biederman <[email protected]>

proc: Usable inode numbers for the namespace file descriptors.

Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every nam

proc: Usable inode numbers for the namespace file descriptors.

Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.

A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.

This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.

We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.

I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.

Signed-off-by: Eric W. Biederman <[email protected]>

show more ...


# 0a01f2cc 01-Aug-2012 Eric W. Biederman <[email protected]>

pidns: Make the pidns proc mount/umount logic obvious.

Track the number of pids in the proc hash table. When the number of
pids goes to 0 schedule work to unmount the kernel mount of proc.

Move th

pidns: Make the pidns proc mount/umount logic obvious.

Track the number of pids in the proc hash table. When the number of
pids goes to 0 schedule work to unmount the kernel mount of proc.

Move the mount of proc into alloc_pid when we allocate the pid for
init.

Remove the surprising calls of pid_ns_release proc in fork and
proc_flush_task. Those code paths really shouldn't know about proc
namespace implementation details and people have demonstrated several
times that finding and understanding those code paths is difficult and
non-obvious.

Because of the call path detach pid is alwasy called with the
rtnl_lock held free_pid is not allowed to sleep, so the work to
unmounting proc is moved to a work queue. This has the side benefit
of not blocking the entire world waiting for the unnecessary
rcu_barrier in deactivate_locked_super.

In the process of making the code clear and obvious this fixes a bug
reported by Gao feng <[email protected]> where we would leak a
mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
succeeded and copy_net_ns failed.

Acked-by: "Serge E. Hallyn" <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>

show more ...


123