|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
6348be02 |
| 20-Jul-2024 |
Al Viro <[email protected]> |
fdget(), trivial conversions
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope.
Reviewed-by: Christian Brauner <[email protected]> Signed
fdget(), trivial conversions
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope.
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
1da91ea8 |
| 31-May-2024 |
Al Viro <[email protected]> |
introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are ve
introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict]
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5 |
|
| #
cb12fd8e |
| 12-Feb-2024 |
Christian Brauner <[email protected]> |
pidfd: add pidfs
This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't abl
pidfd: add pidfs
This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't able to do simply because of the very justified limitations of anonymous inodes. Moving pidfds to a tiny pseudo filesystem allows:
* statx() on pidfds becomes useful for the first time. * pidfds can be compared simply via statx() and then comparing inode numbers. * pidfds have unique inode numbers for the system lifetime. * struct pid is now stashed in inode->i_private instead of file->private_data. This means it is now possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. * file->private_data is freed up for per-file options for pidfds. * Each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. Even if we were to move to anon_inode_create_getfile() which creates new inodes we'd still be associating the same struct pid with multiple different inodes.
The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no complex inode operations, nothing. Dentries and inodes are always deleted when the last pidfd is closed.
We allocate a new inode for each struct pid and we reuse that inode for all pidfds. We use iget_locked() to find that inode again based on the inode number which isn't recycled. We allocate a new dentry for each pidfd that uses the same inode. That is similar to anonymous inodes which reuse the same inode for thousands of dentries. For pidfds we're talking way less than that. There usually won't be a lot of concurrent openers of the same struct pid. They can probably often be counted on two hands. I know that systemd does use separate pidfd for the same struct pid for various complex process tracking issues. So I think with that things actually become way simpler. Especially because we don't have to care about lookup. Dentries and inodes continue to be always deleted.
The code is entirely optional and fairly small. If it's not selected we fallback to anonymous inodes. Heavily inspired by nsfs which uses a similar stashing mechanism just for namespaces.
Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7 |
|
| #
2ddd3cac |
| 18-Aug-2023 |
Elena Reshetova <[email protected]> |
nsproxy: Convert nsproxy.count to refcount_t
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() -
nsproxy: Convert nsproxy.count to refcount_t
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable.
The variable nsproxy.count is used as pure reference counter. Convert it to refcount_t and fix up the operations.
**Important note for maintainers:
Some functions from refcount_t API defined in refcount.h have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information.
Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage.
For the nsproxy.count it might make a difference in following places: - put_nsproxy() and switch_task_namespaces(): decrement in refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE ordering on success vs. fully ordered atomic counterpart
Suggested-by: Kees Cook <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7 |
|
| #
281117cc |
| 14-May-2022 |
Al Viro <[email protected]> |
convert setns(2) to fdget()/fdput()
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
|
| #
2b5f9dad |
| 21-Sep-2022 |
Andrei Vagin <[email protected]> |
fs/exec: switch timens when a task gets a new mm
Changing a time namespace requires remapping a vvar page, so we don't want to allow doing that if any other tasks can use the same mm.
Currently, we
fs/exec: switch timens when a task gets a new mm
Changing a time namespace requires remapping a vvar page, so we don't want to allow doing that if any other tasks can use the same mm.
Currently, we install a time namespace when a task is created with a new vm. exec() is another case when a task gets a new mm and so it can switch a time namespace safely, but it isn't handled now.
One more issue of the current interface is that clone() with CLONE_VM isn't allowed if the current task has unshared a time namespace (timens_for_children doesn't match the current timens).
Both these issues make some inconvenience for users. For example, Alexey and Florian reported that posix_spawn() uses vfork+exec and this pattern doesn't work with time namespaces due to the both described issues. LXC needed to workaround the exec() issue by calling setns.
In the commit 133e2d3e81de5 ("fs/exec: allow to unshare a time namespace on vfork+exec"), we tried to fix these issues with minimal impact on UAPI. But it adds extra complexity and some undesirable side effects. Eric suggested fixing the issues properly because here are all the reasons to suppose that there are no users that depend on the old behavior.
Cc: Alexey Izbyshev <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Florian Weimer <[email protected]> Cc: Kees Cook <[email protected]> Suggested-by: "Eric W. Biederman" <[email protected]> Origin-author: "Eric W. Biederman" <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
33a2d6bc |
| 13-Sep-2022 |
Andrei Vagin <[email protected]> |
Revert "fs/exec: allow to unshare a time namespace on vfork+exec"
This reverts commit 133e2d3e81de5d9706cab2dd1d52d231c27382e5.
Alexey pointed out a few undesirable side effects of the reverted cha
Revert "fs/exec: allow to unshare a time namespace on vfork+exec"
This reverts commit 133e2d3e81de5d9706cab2dd1d52d231c27382e5.
Alexey pointed out a few undesirable side effects of the reverted change. First, it doesn't take into account that CLONE_VFORK can be used with CLONE_THREAD. Second, a child process doesn't enter a target time name-space, if its parent dies before the child calls exec. It happens because the parent clears vfork_done.
Eric W. Biederman suggests installing a time namespace as a task gets a new mm. It includes all new processes cloned without CLONE_VM and all tasks that call exec(). This is an user API change, but we think there aren't users that depend on the old behavior.
It is too late to make such changes in this release, so let's roll back this patch and introduce the right one in the next release.
Cc: Alexey Izbyshev <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Florian Weimer <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
133e2d3e |
| 13-Jun-2022 |
Andrei Vagin <[email protected]> |
fs/exec: allow to unshare a time namespace on vfork+exec
Right now, a new process can't be forked in another time namespace if it shares mm with its parent. It is prohibited, because each time names
fs/exec: allow to unshare a time namespace on vfork+exec
Right now, a new process can't be forked in another time namespace if it shares mm with its parent. It is prohibited, because each time namespace has its own vvar page that is mapped into a process address space.
When a process calls exec, it gets a new mm and so it could be "legal" to switch time namespace in that case. This was not implemented and now if we want to do this, we need to add another clone flag to not break backward compatibility.
We don't have any user requests to switch times on exec except the vfork+exec combination, so there is no reason to add a new clone flag. As for vfork+exec, this should be safe to allow switching timens with the current clone flag. Right now, vfork (CLONE_VFORK | CLONE_VM) fails if a child is forked into another time namespace. With this change, vfork creates a new process in parent's timens, and the following exec does the actual switch to the target time namespace.
Suggested-by: Florian Weimer <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Acked-by: Christian Brauner (Microsoft) <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1 |
|
| #
30acd0bd |
| 02-Sep-2021 |
Vasily Averin <[email protected]> |
memcg: enable accounting for new namesapces and struct nsproxy
Container admin can create new namespaces and force kernel to allocate up to several pages of memory for the namespaces and its associa
memcg: enable accounting for new namesapces and struct nsproxy
Container admin can create new namespaces and force kernel to allocate up to several pages of memory for the namespaces and its associated structures.
Net and uts namespaces have enabled accounting for such allocations. It makes sense to account for rest ones to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vasily Averin <[email protected]> Acked-by: Serge Hallyn <[email protected]> Acked-by: Christian Brauner <[email protected]> Acked-by: Kirill Tkhai <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Yutian Yang <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4 |
|
| #
aabe19b8 |
| 15-Nov-2020 |
Hui Su <[email protected]> |
nsproxy: use put_nsproxy() in switch_task_namespaces()
We already have a dedicated helper that handles reference count checking so stop open-coding the reference count check in switch_task_namespace
nsproxy: use put_nsproxy() in switch_task_namespaces()
We already have a dedicated helper that handles reference count checking so stop open-coding the reference count check in switch_task_namespaces() and use the dedicated put_nsproxy() helper instead. Take the change to fix a whitespace issue too.
Signed-off-by: Hui Su <[email protected]> [[email protected]: expand commit message] Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/20201115180054.GA371317@rlk Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
5c62634f |
| 17-Nov-2020 |
Hui Su <[email protected]> |
namespace: make timens_on_fork() return nothing
timens_on_fork() always return 0, and maybe not need to judge the return value in copy_namespaces().
So make timens_on_fork() return nothing and do n
namespace: make timens_on_fork() return nothing
timens_on_fork() always return 0, and maybe not need to judge the return value in copy_namespaces().
So make timens_on_fork() return nothing and do not judge its return val in copy_namespaces().
Signed-off-by: Hui Su <[email protected]> Link: https://lore.kernel.org/r/20201117161750.GA45121@rlk Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5 |
|
| #
76c12881 |
| 06-Jul-2020 |
Christian Brauner <[email protected]> |
nsproxy: support CLONE_NEWTIME with setns()
So far setns() was missing time namespace support. This was partially due to it simply not being implemented but also because vdso_join_timens() could sti
nsproxy: support CLONE_NEWTIME with setns()
So far setns() was missing time namespace support. This was partially due to it simply not being implemented but also because vdso_join_timens() could still fail which made switching to multiple namespaces atomically problematic. This is now fixed so support CLONE_NEWTIME with setns()
Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Andrei Vagin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: Dmitry Safonov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.8-rc4, v5.8-rc3, v5.8-rc2 |
|
| #
e571d4ee |
| 16-Jun-2020 |
Christian Brauner <[email protected]> |
nsproxy: restore EINVAL for non-namespace file descriptor
The LTP testsuite reported a regression where users would now see EBADF returned instead of EINVAL when an fd was passed that referred to an
nsproxy: restore EINVAL for non-namespace file descriptor
The LTP testsuite reported a regression where users would now see EBADF returned instead of EINVAL when an fd was passed that referred to an open file but the file was not a nsfd. Fix this by continuing to report EINVAL.
Reported-by: kernel test robot <[email protected]> Cc: Jan Stancek <[email protected]> Cc: Cyril Hrubis <[email protected]> Link: https://lore.kernel.org/lkml/20200615085836.GR12456@shao2-debian Fixes: 303cc571d107 ("nsproxy: attach to namespaces via pidfds") Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5 |
|
| #
303cc571 |
| 05-May-2020 |
Christian Brauner <[email protected]> |
nsproxy: attach to namespaces via pidfds
For quite a while we have been thinking about using pidfds to attach to namespaces. This patchset has existed for about a year already but we've wanted to wa
nsproxy: attach to namespaces via pidfds
For quite a while we have been thinking about using pidfds to attach to namespaces. This patchset has existed for about a year already but we've wanted to wait to see how the general api would be received and adopted. Now that more and more programs in userspace have started using pidfds for process management it's time to send this one out.
This patch makes it possible to use pidfds to attach to the namespaces of another process, i.e. they can be passed as the first argument to the setns() syscall. When only a single namespace type is specified the semantics are equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, when a pidfd is passed, multiple namespace flags can be specified in the second setns() argument and setns() will attach the caller to all the specified namespaces all at once or to none of them. Specifying 0 is not valid together with a pidfd.
Here are just two obvious examples: setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET); setns(pidfd, CLONE_NEWUSER); Allowing to also attach subsets of namespaces supports various use-cases where callers setns to a subset of namespaces to retain privilege, perform an action and then re-attach another subset of namespaces.
If the need arises, as Eric suggested, we can extend this patchset to assume even more context than just attaching all namespaces. His suggestion specifically was about assuming the process' root directory when setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just keep it flexible in terms of supporting subsets of namespaces but let's wait until we have users asking for even more context to be assumed. At that point we can add an extension.
The obvious example where this is useful is a standard container manager interacting with a running container: pushing and pulling files or directories, injecting mounts, attaching/execing any kind of process, managing network devices all these operations require attaching to all or at least multiple namespaces at the same time. Given that nowadays most containers are spawned with all namespaces enabled we're currently looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns> nsfds, another 7 to actually perform the namespace switch. With time namespaces we're looking at about 16 syscalls. (We could amortize the first 7 or 8 syscalls for opening the nsfds by stashing them in each container's monitor process but that would mean we need to send around those file descriptors through unix sockets everytime we want to interact with the container or keep on-disk state. Even in scenarios where a caller wants to join a particular namespace in a particular order callers still profit from batching other namespaces. That mostly applies to the user namespace but all container runtimes I found join the user namespace first no matter if it privileges or deprivileges the container similar to how unshare behaves.) With pidfds this becomes a single syscall no matter how many namespaces are supposed to be attached to.
A decently designed, large-scale container manager usually isn't the parent of any of the containers it spawns so the containers don't die when it crashes or needs to update or reinitialize. This means that for the manager to interact with containers through pids is inherently racy especially on systems where the maximum pid number is not significicantly bumped. This is even more problematic since we often spawn and manage thousands or ten-thousands of containers. Interacting with a container through a pid thus can become risky quite quickly. Especially since we allow for an administrator to enable advanced features such as syscall interception where we're performing syscalls in lieu of the container. In all of those cases we use pidfds if they are available and we pass them around as stable references. Using them to setns() to the target process' namespaces is as reliable as using nsfds. Either the target process is already dead and we get ESRCH or we manage to attach to its namespaces but we can't accidently attach to another process' namespaces. So pidfds lend themselves to be used with this api. The other main advantage is that with this change the pidfd becomes the only relevant token for most container interactions and it's the only token we need to create and send around.
Apart from significiantly reducing the number of syscalls from double digit to single digit which is a decent reason post-spectre/meltdown this also allows to switch to a set of namespaces atomically, i.e. either attaching to all the specified namespaces succeeds or we fail. If we fail we haven't changed a single namespace. There are currently three namespaces that can fail (other than for ENOMEM which really is not very interesting since we then have other problems anyway) for non-trivial reasons, user, mount, and pid namespaces. We can fail to attach to a pid namespace if it is not our current active pid namespace or a descendant of it. We can fail to attach to a user namespace because we are multi-threaded or because our current mount namespace shares filesystem state with other tasks, or because we're trying to setns() to the same user namespace, i.e. the target task has the same user namespace as we do. We can fail to attach to a mount namespace because it shares filesystem state with other tasks or because we fail to lookup the new root for the new mount namespace. In most non-pathological scenarios these issues can be somewhat mitigated. But there are cases where we're half-attached to some namespace and failing to attach to another one. I've talked about some of these problem during the hallway track (something only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles in 2018(?). Even if all these issues could be avoided with super careful userspace coding it would be nicer to have this done in-kernel. Pidfds seem to lend themselves nicely for this.
The other neat thing about this is that setns() becomes an actual counterpart to the namespace bits of unshare().
Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Aleksa Sarai <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
f2a8d52e |
| 05-May-2020 |
Christian Brauner <[email protected]> |
nsproxy: add struct nsset
Add a simple struct nsset. It holds all necessary pieces to switch to a new set of namespaces without leaving a task in a half-switched state which we will make use of in t
nsproxy: add struct nsset
Add a simple struct nsset. It holds all necessary pieces to switch to a new set of namespaces without leaving a task in a half-switched state which we will make use of in the next patch. This patch switches the existing setns logic over without causing a change in setns() behavior. This brings setns() closer to how unshare() works(). The prepare_ns() function is responsible to prepare all necessary information. This has two reasons. First it minimizes dependencies between individual namespaces, i.e. all install handler can expect that all fields are properly initialized independent in what order they are called in. Second, this makes the code easier to maintain and easier to follow if it needs to be changed.
The prepare_ns() helper will only be switched over to use a flags argument in the next patch. Here it will still use nstype as a simple integer argument which was argued would be clearer. I'm not particularly opinionated about this if it really helps or not. The struct nsset itself already contains the flags field since its name already indicates that it can contain information required by different namespaces. None of this should have functional consequences.
Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Aleksa Sarai <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.7-rc4, v5.7-rc3, v5.7-rc2, v5.7-rc1, v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3, v5.5-rc2, v5.5-rc1, v5.4, v5.4-rc8 |
|
| #
769071ac |
| 12-Nov-2019 |
Andrei Vagin <[email protected]> |
ns: Introduce Time Namespace
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME System-wi
ns: Introduce Time Namespace
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC Clock that cannot be set and represents monotonic time since some unspecified starting point.
CLOCK_BOOTTIME Identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended.
For many users, the time namespace means the ability to changes date and time in a container (CLOCK_REALTIME). Providing per namespace notions of CLOCK_REALTIME would be complex with a massive overhead, but has a dubious value.
But in the context of checkpoint/restore functionality, monotonic and boottime clocks become interesting. Both clocks are monotonic with unspecified starting points. These clocks are widely used to measure time slices and set timers. After restoring or migrating processes, it has to be guaranteed that they never go backward. In an ideal case, the behavior of these clocks should be the same as for a case when a whole system is suspended. All this means that it is required to set CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace offsets for clocks.
A time namespace is similar to a pid namespace in the way how it is created: unshare(CLONE_NEWTIME) system call creates a new time namespace, but doesn't set it to the current process. Then all children of the process will be born in the new time namespace, or a process can use the setns() system call to join a namespace.
This scheme allows setting clock offsets for a namespace, before any processes appear in it.
All available clone flags have been used, so CLONE_NEWTIME uses the highest bit of CSIGNAL. It means that it can be used only with the unshare() and the clone3() system calls.
[ tglx: Adjusted paragraph about clone3() to reality and massaged the changelog a bit. ]
Co-developed-by: Dmitry Safonov <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Dmitry Safonov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://criu.org/Time_namespace Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.4-rc7, v5.4-rc6, v5.4-rc5, v5.4-rc4, v5.4-rc3, v5.4-rc2, v5.4-rc1, v5.3, v5.3-rc8, v5.3-rc7, v5.3-rc6, v5.3-rc5, v5.3-rc4, v5.3-rc3, v5.3-rc2, v5.3-rc1, v5.2, v5.2-rc7, v5.2-rc6, v5.2-rc5, v5.2-rc4, v5.2-rc3 |
|
| #
b886d83c |
| 01-Jun-2019 |
Thomas Gleixner <[email protected]> |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify it under the terms of th
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation version 2 of the license
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 315 file(s).
Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Allison Randal <[email protected]> Reviewed-by: Armijn Hemel <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v5.2-rc2, v5.2-rc1, v5.1, v5.1-rc7, v5.1-rc6, v5.1-rc5, v5.1-rc4, v5.1-rc3, v5.1-rc2, v5.1-rc1, v5.0, v5.0-rc8, v5.0-rc7, v5.0-rc6, v5.0-rc5, v5.0-rc4, v5.0-rc3, v5.0-rc2, v5.0-rc1, v4.20, v4.20-rc7, v4.20-rc6, v4.20-rc5, v4.20-rc4, v4.20-rc3, v4.20-rc2, v4.20-rc1, v4.19, v4.19-rc8, v4.19-rc7, v4.19-rc6, v4.19-rc5, v4.19-rc4, v4.19-rc3, v4.19-rc2, v4.19-rc1, v4.18, v4.18-rc8, v4.18-rc7, v4.18-rc6, v4.18-rc5, v4.18-rc4, v4.18-rc3, v4.18-rc2, v4.18-rc1, v4.17, v4.17-rc7, v4.17-rc6, v4.17-rc5, v4.17-rc4, v4.17-rc3, v4.17-rc2, v4.17-rc1, v4.16, v4.16-rc7, v4.16-rc6, v4.16-rc5, v4.16-rc4, v4.16-rc3, v4.16-rc2, v4.16-rc1, v4.15, v4.15-rc9, v4.15-rc8, v4.15-rc7, v4.15-rc6, v4.15-rc5, v4.15-rc4, v4.15-rc3, v4.15-rc2, v4.15-rc1, v4.14, v4.14-rc8, v4.14-rc7, v4.14-rc6, v4.14-rc5, v4.14-rc4, v4.14-rc3, v4.14-rc2, v4.14-rc1, v4.13, v4.13-rc7, v4.13-rc6, v4.13-rc5, v4.13-rc4, v4.13-rc3, v4.13-rc2, v4.13-rc1, v4.12, v4.12-rc7, v4.12-rc6, v4.12-rc5, v4.12-rc4, v4.12-rc3, v4.12-rc2, v4.12-rc1, v4.11, v4.11-rc8, v4.11-rc7, v4.11-rc6, v4.11-rc5, v4.11-rc4, v4.11-rc3, v4.11-rc2 |
|
| #
e4222673 |
| 07-Mar-2017 |
Hari Bathini <[email protected]> |
perf: Add PERF_RECORD_NAMESPACES to include namespaces related info
With the advert of container technologies like docker, that depend on namespaces for isolation, there is a need for tracing suppor
perf: Add PERF_RECORD_NAMESPACES to include namespaces related info
With the advert of container technologies like docker, that depend on namespaces for isolation, there is a need for tracing support for namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for recording namespaces related info. By recording info for every namespace, it is left to userspace to take a call on the definition of a container and trace containers by updating perf tool accordingly.
Each namespace has a combination of device and inode numbers. Though every namespace has the same device number currently, that may change in future to avoid the need for a namespace of namespaces. Considering such possibility, record both device and inode numbers separately for each namespace.
Signed-off-by: Hari Bathini <[email protected]> Acked-by: Jiri Olsa <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Ananth N Mavinakayanahalli <[email protected]> Cc: Aravinda Prasad <[email protected]> Cc: Brendan Gregg <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Sargun Dhillon <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
show more ...
|
|
Revision tags: v4.11-rc1, v4.10, v4.10-rc8, v4.10-rc7, v4.10-rc6, v4.10-rc5, v4.10-rc4, v4.10-rc3, v4.10-rc2, v4.10-rc1, v4.9, v4.9-rc8, v4.9-rc7, v4.9-rc6, v4.9-rc5, v4.9-rc4, v4.9-rc3, v4.9-rc2, v4.9-rc1, v4.8, v4.8-rc8, v4.8-rc7, v4.8-rc6, v4.8-rc5, v4.8-rc4, v4.8-rc3, v4.8-rc2, v4.8-rc1, v4.7, v4.7-rc7, v4.7-rc6, v4.7-rc5, v4.7-rc4, v4.7-rc3, v4.7-rc2, v4.7-rc1, v4.6, v4.6-rc7, v4.6-rc6, v4.6-rc5, v4.6-rc4, v4.6-rc3, v4.6-rc2, v4.6-rc1, v4.5, v4.5-rc7, v4.5-rc6, v4.5-rc5, v4.5-rc4, v4.5-rc3, v4.5-rc2 |
|
| #
a79a908f |
| 29-Jan-2016 |
Aditya Kali <[email protected]> |
cgroup: introduce cgroup namespaces
Introduce the ability to create new cgroup namespace. The newly created cgroup namespace remembers the cgroup of the process at the point of creation of the cgrou
cgroup: introduce cgroup namespaces
Introduce the ability to create new cgroup namespace. The newly created cgroup namespace remembers the cgroup of the process at the point of creation of the cgroup namespace (referred as cgroupns-root). The main purpose of cgroup namespace is to virtualize the contents of /proc/self/cgroup file. Processes inside a cgroup namespace are only able to see paths relative to their namespace root (unless they are moved outside of their cgroupns-root, at which point they will see a relative path from their cgroupns-root). For a correctly setup container this enables container-tools (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized containers without leaking system level cgroup hierarchy to the task. This patch only implements the 'unshare' part of the cgroupns.
Signed-off-by: Aditya Kali <[email protected]> Signed-off-by: Serge Hallyn <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v4.5-rc1, v4.4, v4.4-rc8, v4.4-rc7, v4.4-rc6, v4.4-rc5, v4.4-rc4, v4.4-rc3, v4.4-rc2, v4.4-rc1, v4.3, v4.3-rc7, v4.3-rc6, v4.3-rc5, v4.3-rc4, v4.3-rc3, v4.3-rc2, v4.3-rc1, v4.2, v4.2-rc8, v4.2-rc7, v4.2-rc6, v4.2-rc5, v4.2-rc4, v4.2-rc3, v4.2-rc2, v4.2-rc1, v4.1, v4.1-rc8, v4.1-rc7, v4.1-rc6, v4.1-rc5, v4.1-rc4, v4.1-rc3, v4.1-rc2, v4.1-rc1, v4.0, v4.0-rc7, v4.0-rc6, v4.0-rc5, v4.0-rc4, v4.0-rc3, v4.0-rc2, v4.0-rc1, v3.19, v3.19-rc7, v3.19-rc6, v3.19-rc5, v3.19-rc4, v3.19-rc3, v3.19-rc2, v3.19-rc1, v3.18, v3.18-rc7, v3.18-rc6, v3.18-rc5, v3.18-rc4, v3.18-rc3 |
|
| #
f77c8014 |
| 01-Nov-2014 |
Al Viro <[email protected]> |
bury struct proc_ns in fs/proc
a) make get_proc_ns() return a pointer to struct ns_common b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that is_mnt_ns_file() could get away with fewer dere
bury struct proc_ns in fs/proc
a) make get_proc_ns() return a pointer to struct ns_common b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that is_mnt_ns_file() could get away with fewer dereferences.
That way struct proc_ns becomes invisible outside of fs/proc/*.c
Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
33c42940 |
| 01-Nov-2014 |
Al Viro <[email protected]> |
copy address of proc_ns_ops into ns_common
Signed-off-by: Al Viro <[email protected]>
|
|
Revision tags: v3.18-rc2, v3.18-rc1, v3.17, v3.17-rc7, v3.17-rc6, v3.17-rc5, v3.17-rc4, v3.17-rc3, v3.17-rc2, v3.17-rc1, v3.16, v3.16-rc7, v3.16-rc6, v3.16-rc5, v3.16-rc4, v3.16-rc3, v3.16-rc2, v3.16-rc1, v3.15, v3.15-rc8, v3.15-rc7, v3.15-rc6, v3.15-rc5, v3.15-rc4, v3.15-rc3, v3.15-rc2, v3.15-rc1, v3.14, v3.14-rc8, v3.14-rc7, v3.14-rc6, v3.14-rc5, v3.14-rc4, v3.14-rc3, v3.14-rc2 |
|
| #
728dba3a |
| 04-Feb-2014 |
Eric W. Biederman <[email protected]> |
namespaces: Use task_lock and not rcu to protect nsproxy
The synchronous syncrhonize_rcu in switch_task_namespaces makes setns a sufficiently expensive system call that people have complained.
Upon
namespaces: Use task_lock and not rcu to protect nsproxy
The synchronous syncrhonize_rcu in switch_task_namespaces makes setns a sufficiently expensive system call that people have complained.
Upon inspect nsproxy no longer needs rcu protection for remote reads. remote reads are rare. So optimize for same process reads and write by switching using rask_lock instead.
This yields a simpler to understand lock, and a faster setns system call.
In particular this fixes a performance regression observed by Rafael David Tinoco <[email protected]>.
This is effectively a revert of Pavel Emelyanov's commit cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter from 2007. The race this originialy fixed no longer exists as do_notify_parent uses task_active_pid_ns(parent) instead of parent->nsproxy.
Signed-off-by: "Eric W. Biederman" <[email protected]>
show more ...
|
|
Revision tags: v3.14-rc1, v3.13, v3.13-rc8, v3.13-rc7, v3.13-rc6, v3.13-rc5, v3.13-rc4, v3.13-rc3, v3.13-rc2, v3.13-rc1, v3.12, v3.12-rc7, v3.12-rc6, v3.12-rc5, v3.12-rc4, v3.12-rc3, v3.12-rc2, v3.12-rc1, v3.11, v3.11-rc7, v3.11-rc6, v3.11-rc5, v3.11-rc4, v3.11-rc3, v3.11-rc2, v3.11-rc1, v3.10, v3.10-rc7, v3.10-rc6, v3.10-rc5, v3.10-rc4, v3.10-rc3, v3.10-rc2, v3.10-rc1, v3.9, v3.9-rc8, v3.9-rc7, v3.9-rc6, v3.9-rc5, v3.9-rc4, v3.9-rc3, v3.9-rc2 |
|
| #
dbef0c1c |
| 10-Mar-2013 |
Eric W. Biederman <[email protected]> |
namespaces: Simplify copy_namespaces so it is clear what is going on.
Remove the test for the impossible case where tsk->nsproxy == NULL. Fork will never be called with tsk->nsproxy == NULL.
Only
namespaces: Simplify copy_namespaces so it is clear what is going on.
Remove the test for the impossible case where tsk->nsproxy == NULL. Fork will never be called with tsk->nsproxy == NULL.
Only call get_nsproxy when we don't need to generate a new_nsproxy, and mark the case where we don't generate a new nsproxy as likely.
Remove the code to drop an unnecessarily acquired nsproxy value.
Acked-by: Serge Hallyn <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
show more ...
|
| #
c2b1df2e |
| 22-Aug-2013 |
Andy Lutomirski <[email protected]> |
Rename nsproxy.pid_ns to nsproxy.pid_ns_for_children
nsproxy.pid_ns is *not* the task's pid namespace. The name should clarify that.
This makes it more obvious that setns on a pid namespace is wei
Rename nsproxy.pid_ns to nsproxy.pid_ns_for_children
nsproxy.pid_ns is *not* the task's pid namespace. The name should clarify that.
This makes it more obvious that setns on a pid namespace is weird -- it won't change the pid namespace shown in procfs.
Signed-off-by: Andy Lutomirski <[email protected]> Reviewed-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v3.9-rc1 |
|
| #
21e85194 |
| 27-Feb-2013 |
Raphael S.Carvalho <[email protected]> |
kernel/nsproxy.c: Improving a snippet of code.
It seems GCC generates a better code in that way, so I changed that statement. Btw, they have the same semantic, so I'm sending this patch due to perfo
kernel/nsproxy.c: Improving a snippet of code.
It seems GCC generates a better code in that way, so I changed that statement. Btw, they have the same semantic, so I'm sending this patch due to performance issues.
Acked-by: Serge E. Hallyn <[email protected]> Signed-off-by: Raphael S.Carvalho <[email protected]> Signed-off-by: Eric W. Biederman <[email protected]>
show more ...
|