|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
d385c8bc |
| 05-Mar-2025 |
Michal Koutný <[email protected]> |
pid: Do not set pid_max in new pid namespaces
It is already difficult for users to troubleshoot which of multiple pid limits restricts their workload. The per-(hierarchical-)NS pid_max would contrib
pid: Do not set pid_max in new pid namespaces
It is already difficult for users to troubleshoot which of multiple pid limits restricts their workload. The per-(hierarchical-)NS pid_max would contribute to the confusion. Also, the implementation copies the limit upon creation from parent, this pattern showed cumbersome with some attributes in legacy cgroup controllers -- it's subject to race condition between parent's limit modification and children creation and once copied it must be changed in the descendant.
Let's do what other places do (ucounts or cgroup limits) -- create new pid namespaces without any limit at all. The global limit (actually any ancestor's limit) is still effectively in place, we avoid the set/unshare race and bumps of global (ancestral) limit have the desired effect on pid namespace that do not care.
Link: https://lore.kernel.org/r/[email protected]/ Link: https://lore.kernel.org/r/[email protected]/ Fixes: 7863dcc72d0f4 ("pid: allow pid_max to be set per pid namespace") Signed-off-by: Michal Koutný <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
1751f872 |
| 28-Jan-2025 |
Joel Granados <[email protected]> |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysc
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers.
Created this by running an spatch followed by a sed command: Spatch: virtual patch
@ depends on !(file in "net") disable optional_qualifier @
identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@
+ const struct ctl_table table_name [] = { ... };
sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c
Reviewed-by: Song Liu <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> # for kernel/trace/ Reviewed-by: Martin K. Petersen <[email protected]> # SCSI Reviewed-by: Darrick J. Wong <[email protected]> # xfs Acked-by: Jani Nikula <[email protected]> Acked-by: Corey Minyard <[email protected]> Acked-by: Wei Liu <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Reviewed-by: Bill O'Donnell <[email protected]> Acked-by: Baoquan He <[email protected]> Acked-by: Ashutosh Dixit <[email protected]> Acked-by: Anna Schumaker <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
7863dcc7 |
| 22-Nov-2024 |
Christian Brauner <[email protected]> |
pid: allow pid_max to be set per pid namespace
The pid_max sysctl is a global value. For a long time the default value has been 65535 and during the pidfd dicussions Linus proposed to bump pid_max b
pid: allow pid_max to be set per pid namespace
The pid_max sysctl is a global value. For a long time the default value has been 65535 and during the pidfd dicussions Linus proposed to bump pid_max by default (cf. [1]). Based on this discussion systemd started bumping pid_max to 2^22. So all new systems now run with a very high pid_max limit with some distros having also backported that change. The decision to bump pid_max is obviously correct. It just doesn't make a lot of sense nowadays to enforce such a low pid number. There's sufficient tooling to make selecting specific processes without typing really large pid numbers available.
In any case, there are workloads that have expections about how large pid numbers they accept. Either for historical reasons or architectural reasons. One concreate example is the 32-bit version of Android's bionic libc which requires pid numbers less than 65536. There are workloads where it is run in a 32-bit container on a 64-bit kernel. If the host has a pid_max value greater than 65535 the libc will abort thread creation because of size assumptions of pthread_mutex_t.
That's a fairly specific use-case however, in general specific workloads that are moved into containers running on a host with a new kernel and a new systemd can run into issues with large pid_max values. Obviously making assumptions about the size of the allocated pid is suboptimal but we have userspace that does it.
Of course, giving containers the ability to restrict the number of processes in their respective pid namespace indepent of the global limit through pid_max is something desirable in itself and comes in handy in general.
Independent of motivating use-cases the existence of pid namespaces makes this also a good semantical extension and there have been prior proposals pushing in a similar direction. The trick here is to minimize the risk of regressions which I think is doable. The fact that pid namespaces are hierarchical will help us here.
What we mostly care about is that when the host sets a low pid_max limit, say (crazy number) 100 that no descendant pid namespace can allocate a higher pid number in its namespace. Since pid allocation is hierarchial this can be ensured by checking each pid allocation against the pid namespace's pid_max limit. This means if the allocation in the descendant pid namespace succeeds, the ancestor pid namespace can reject it. If the ancestor pid namespace has a higher limit than the descendant pid namespace the descendant pid namespace will reject the pid allocation. The ancestor pid namespace will obviously not care about this. All in all this means pid_max continues to enforce a system wide limit on the number of processes but allows pid namespaces sufficient leeway in handling workloads with assumptions about pid values and allows containers to restrict the number of processes in a pid namespace through the pid_max interface.
[1]: https://lore.kernel.org/linux-api/CAHk-=wiZ40LVjnXSi9iHLE_-ZBsWFGCgdmNiYZUXn1-V5YBg2g@mail.gmail.com - rebased from 5.14-rc1 - a few fixes (missing ns_free_inum on error path, missing initialization, etc) - permission check changes in pid_table_root_permissions - unsigned int pid_max -> int pid_max (keep pid_max type as it was) - add READ_ONCE in alloc_pid() as suggested by Christian - rebased from 6.7 and take into account: * sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table) * sysctl: treewide: constify ctl_table_header::ctl_table_arg * pidfd: add pidfs * tracing: Move saved_cmdline code into trace_sched_switch.c
Signed-off-by: Alexander Mikhalitsyn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
78eb4ea2 |
| 24-Jul-2024 |
Joel Granados <[email protected]> |
sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ct
sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified.
This patch has been generated by the following coccinelle script:
``` virtual patch
@r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@ identifier func, ctl, write, buffer, lenp, ppos; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... }
@r3@ identifier func; @@
int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *);
@r4@ identifier func, ctl; @@
int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *);
@r5@ identifier func, write, buffer, lenp, ppos; @@
int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration.
Co-developed-by: Thomas Weißschuh <[email protected]> Signed-off-by: Thomas Weißschuh <[email protected]> Co-developed-by: Joel Granados <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3 |
|
| #
7fea700e |
| 08-Jun-2024 |
Oleg Nesterov <[email protected]> |
zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING
kernel_wait4() doesn't sleep and returns -EINTR if there is no eligible child and signal_pending() is true.
That is why zap_p
zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING
kernel_wait4() doesn't sleep and returns -EINTR if there is no eligible child and signal_pending() is true.
That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending() return false and avoid a busy-wait loop.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") Signed-off-by: Oleg Nesterov <[email protected]> Reported-by: Rachel Menge <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Boqun Feng <[email protected]> Tested-by: Wei Fu <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Cc: Allen Pais <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Joel Granados <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Mike Christie <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Zqiang <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6 |
|
| #
9855c37e |
| 25-Apr-2024 |
Frederic Weisbecker <[email protected]> |
Revert "rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()"
This reverts commit 28319d6dc5e2ffefa452c2377dd0f71621b5bff0. The race it fixed was subject to conditions that don't exist a
Revert "rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()"
This reverts commit 28319d6dc5e2ffefa452c2377dd0f71621b5bff0. The race it fixed was subject to conditions that don't exist anymore since:
1612160b9127 ("rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks")
This latter commit removes the use of SRCU that used to cover the RCU-tasks blind spot on exit between the tasklist's removal and the final preemption disabling. The task is now placed instead into a temporary list inside which voluntary sleeps are accounted as RCU-tasks quiescent states. This would disarm the deadlock initially reported against PID namespace exit.
Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1 |
|
| #
11a92190 |
| 27-Jun-2023 |
Joel Granados <[email protected]> |
kernel misc: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (
kernel misc: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%[email protected]/)
Remove the sentinel from ctl_table arrays. Reduce by one the values used to compare the size of the adjusted arrays.
Signed-off-by: Joel Granados <[email protected]>
show more ...
|
| #
6dfeff09 |
| 11-Dec-2023 |
Matthew Wilcox (Oracle) <[email protected]> |
wait: Remove uapi header file from main header file
There's really no overlap between uapi/linux/wait.h and linux/wait.h. There are two files which rely on the uapi file being implcitly included, so
wait: Remove uapi header file from main header file
There's really no overlap between uapi/linux/wait.h and linux/wait.h. There are two files which rely on the uapi file being implcitly included, so explicitly include it there and remove it from the main header file.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Kent Overstreet <[email protected]> Reviewed-by: Christian Brauner <[email protected]>
show more ...
|
| #
2d57792a |
| 11-Sep-2023 |
Rong Tao <[email protected]> |
pid: pid_ns_ctl_handler: remove useless comment
commit 95846ecf9dac("pid: replace pid bitmap implementation with IDR API") removes 'last_pid' element, and use the idr_get_cursor-idr_set_cursor pair
pid: pid_ns_ctl_handler: remove useless comment
commit 95846ecf9dac("pid: replace pid bitmap implementation with IDR API") removes 'last_pid' element, and use the idr_get_cursor-idr_set_cursor pair to set the value of idr, so useless comments should be removed.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rong Tao <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Luis Chamberlain <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
9876cfe8 |
| 14-Aug-2023 |
Aleksa Sarai <[email protected]> |
memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
This sysctl has the very unusual behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, mean
memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
This sysctl has the very unusual behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were to set this sysctl to a more restrictive option in the host pidns you would need to reboot your machine in order to reset it.
The justification given in [1] is that this is a security feature and thus it should not be possible to disable. Aside from the fact that we have plenty of security-related sysctls that can be disabled after being enabled (fs.protected_symlinks for instance), the protection provided by the sysctl is to stop users from being able to create a binary and then execute it. A user with CAP_SYS_ADMIN can trivially do this without memfd_create(2):
% cat mount-memfd.c #include <fcntl.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/mount.h>
#define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"
int main(void) { int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); assert(fsfd >= 0); assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));
int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); assert(dfd >= 0);
int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782); assert(execfd >= 0); assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); assert(!close(execfd));
char *execpath = NULL; char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL }; execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC); assert(execfd >= 0); assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); assert(!execve(execpath, argv, envp)); } % ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 %
Given that it is possible for CAP_SYS_ADMIN users to create executable binaries without memfd_create(2) and without touching the host filesystem (not to mention the many other things a CAP_SYS_ADMIN process would be able to do that would be equivalent or worse), it seems strange to cause a fair amount of headache to admins when there doesn't appear to be an actual security benefit to blocking this. There appear to be concerns about confused-deputy-esque attacks[2] but a confused deputy that can write to arbitrary sysctls is a bigger security issue than executable memfds.
/* New API */
The primary requirement from the original author appears to be more based on the need to be able to restrict an entire system in a hierarchical manner[3], such that child namespaces cannot re-enable executable memfds.
So, implement that behaviour explicitly -- the vm.memfd_noexec scope is evaluated up the pidns tree to &init_pid_ns and you have the most restrictive value applied to you. The new lower limit you can set vm.memfd_noexec is whatever limit applies to your parent.
Note that a pidns will inherit a copy of the parent pidns's effective vm.memfd_noexec setting at unshare() time. This matches the existing behaviour, and it also ensures that a pidns will never have its vm.memfd_noexec setting *lowered* behind its back (but it will be raised if the parent raises theirs).
/* Backwards Compatibility */
As the previous version of the sysctl didn't allow you to lower the setting at all, there are no backwards compatibility issues with this aspect of the change.
However it should be noted that now that the setting is completely hierarchical. Previously, a cloned pidns would just copy the current pidns setting, meaning that if the parent's vm.memfd_noexec was changed it wouldn't propoagate to existing pid namespaces. Now, the restriction applies recursively. This is a uAPI change, however:
* The sysctl is very new, having been merged in 6.3. * Several aspects of the sysctl were broken up until this patchset and the other patchset by Jeff Xu last month.
And thus it seems incredibly unlikely that any real users would run into this issue. In the worst case, if this causes userspace isues we could make it so that modifying the setting follows the hierarchical rules but the restriction checking uses the cached copy.
[1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/ [2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/ [3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected] Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <[email protected]> Cc: Dominique Martinet <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Daniel Verkamp <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
dd546618 |
| 01-Jul-2023 |
Christian Brauner <[email protected]> |
pid: use struct_size_t() helper
Before commit d67790ddf021 ("overflow: Add struct_size_t() helper") only struct_size() existed, which expects a valid pointer instance containing the flexible array.
pid: use struct_size_t() helper
Before commit d67790ddf021 ("overflow: Add struct_size_t() helper") only struct_size() existed, which expects a valid pointer instance containing the flexible array.
However, when we determine the default struct pid allocation size for the associated kmem cache of a pid namespace we need to take the nesting depth of the pid namespace into account without an variable instance necessarily being available.
In commit b69f0aeb0689 ("pid: Replace struct pid 1-element array with flex-array") we used to handle this the old fashioned way and cast NULL to a struct pid pointer type. However, we do apparently have a dedicated struct_size_t() helper for exactly this case. So switch to that.
Suggested-by: Kees Cook <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
b69f0aeb |
| 30-Jun-2023 |
Kees Cook <[email protected]> |
pid: Replace struct pid 1-element array with flex-array
For pid namespaces, struct pid uses a dynamically sized array member, "numbers". This was implemented using the ancient 1-element fake flexib
pid: Replace struct pid 1-element array with flex-array
For pid namespaces, struct pid uses a dynamically sized array member, "numbers". This was implemented using the ancient 1-element fake flexible array, which has been deprecated for decades.
Replace it with a C99 flexible array, refactor the array size calculations to use struct_size(), and address elements via indexes. Note that the static initializer (which defines a single element) works as-is, and requires no special handling.
Without this, CONFIG_UBSAN_BOUNDS (and potentially CONFIG_FORTIFY_SOURCE) will trigger bounds checks:
https://lore.kernel.org/lkml/20230517-bushaltestelle-super-e223978c1ba6@brauner
Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Daniel Verkamp <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Frederic Weisbecker <[email protected]> Reported-by: [email protected] [brauner: dropped unrelated changes and remove 0 with NULL cast] Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1 |
|
| #
9e7c73c0 |
| 02-Mar-2023 |
Luis Chamberlain <[email protected]> |
kernel: pid_namespace: simplify sysctls with register_sysctl()
register_sysctl_paths() is only required if your child (directories) have entries and pid_namespace does not. So use register_sysctl_in
kernel: pid_namespace: simplify sysctls with register_sysctl()
register_sysctl_paths() is only required if your child (directories) have entries and pid_namespace does not. So use register_sysctl_init() instead where we don't care about the return value and use register_sysctl() where we do.
Signed-off-by: Luis Chamberlain <[email protected]> Acked-by: Jeff Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1 |
|
| #
105ff533 |
| 15-Dec-2022 |
Jeff Xu <[email protected]> |
mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set executable bit at creation time (memfd_create).
When MFD_NOEXEC_SEAL is set, memfd is
mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set executable bit at creation time (memfd_create).
When MFD_NOEXEC_SEAL is set, memfd is created without executable bit (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be executable (mode: 0777) after creation.
when MFD_EXEC flag is set, memfd is created with executable bit (mode:0777), this is the same as the old behavior of memfd_create.
The new pid namespaced sysctl vm.memfd_noexec has 3 values: 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_EXEC was set. 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_NOEXEC_SEAL was set. 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.
The sysctl allows finer control of memfd_create for old-software that doesn't set the executable bit, for example, a container with vm.memfd_noexec=1 means the old-software will create non-executable memfd by default. Also, the value of memfd_noexec is passed to child namespace at creation time. For example, if the init namespace has vm.memfd_noexec=2, all its children namespaces will be created with 2.
[[email protected]: add stub functions to fix build] [[email protected]: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff] [[email protected]: s/pr_warn_ratelimited/pr_warn_once/, per review] [[email protected]: fix CONFIG_SYSCTL=n warning] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Xu <[email protected]> Co-developed-by: Daniel Verkamp <[email protected]> Signed-off-by: Daniel Verkamp <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: David Herrmann <[email protected]> Cc: Dmitry Torokhov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jorge Lucangeli Obes <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.1, v6.1-rc8, v6.1-rc7 |
|
| #
28319d6d |
| 25-Nov-2022 |
Frederic Weisbecker <[email protected]> |
rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()
RCU Tasks and PID-namespace unshare can interact in do_exit() in a complicated circular dependency:
1) TASK A calls unshare(CLONE_NE
rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()
RCU Tasks and PID-namespace unshare can interact in do_exit() in a complicated circular dependency:
1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace that every subsequent child of TASK A will belong to. But TASK A doesn't itself belong to that new PID namespace.
2) TASK A forks() and creates TASK B. TASK A stays attached to its PID namespace (let's say PID_NS1) and TASK B is the first task belonging to the new PID namespace created by unshare() (let's call it PID_NS2).
3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2 child reaper.
4) TASK A forks() again and creates TASK C which get attached to PID_NS2. Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.
5) TASK B exits and since it is the child reaper for PID_NS2, it has to kill all other tasks attached to PID_NS2, and wait for all of them to die before getting reaped itself (zap_pid_ns_process()).
6) TASK A calls synchronize_rcu_tasks() which leads to synchronize_srcu(&tasks_rcu_exit_srcu).
7) TASK B is waiting for TASK C to get reaped. But TASK B is under a tasks_rcu_exit_srcu SRCU critical section (exit_notify() is between exit_tasks_rcu_start() and exit_tasks_rcu_finish()), blocking TASK A.
8) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C, but it can't because TASK A waits for TASK B that waits for TASK C.
Pid_namespace semantics can hardly be changed at this point. But the coverage of tasks_rcu_exit_srcu can be reduced instead.
The current task is assumed not to be concurrently reapable at this stage of exit_notify() and therefore tasks_rcu_exit_srcu can be temporarily relaxed without breaking its constraints, providing a way out of the deadlock scenario.
[ paulmck: Fix build failure by adding additional declaration. ]
Fixes: 3f95aa81d265 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting") Reported-by: Pengfei Xu <[email protected]> Suggested-by: Boqun Feng <[email protected]> Suggested-by: Neeraj Upadhyay <[email protected]> Suggested-by: Paul E. McKenney <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Eric W . Biederman <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5 |
|
| #
c06d7aaf |
| 29-Apr-2022 |
Haowen Bai <[email protected]> |
kernel: pid_namespace: use NULL instead of using plain integer as pointer
This fixes the following sparse warnings: kernel/pid_namespace.c:55:77: warning: Using plain integer as NULL pointer
Link:
kernel: pid_namespace: use NULL instead of using plain integer as pointer
This fixes the following sparse warnings: kernel/pid_namespace.c:55:77: warning: Using plain integer as NULL pointer
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Haowen Bai <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1 |
|
| #
30acd0bd |
| 02-Sep-2021 |
Vasily Averin <[email protected]> |
memcg: enable accounting for new namesapces and struct nsproxy
Container admin can create new namespaces and force kernel to allocate up to several pages of memory for the namespaces and its associa
memcg: enable accounting for new namesapces and struct nsproxy
Container admin can create new namespaces and force kernel to allocate up to several pages of memory for the namespaces and its associated structures.
Net and uts namespaces have enabled accounting for such allocations. It makes sense to account for rest ones to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vasily Averin <[email protected]> Acked-by: Serge Hallyn <[email protected]> Acked-by: Christian Brauner <[email protected]> Acked-by: Kirill Tkhai <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Yutian Yang <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
fab827db |
| 02-Sep-2021 |
Vasily Averin <[email protected]> |
memcg: enable accounting for pids in nested pid namespaces
Commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg") enabled memcg accounting for pids allocated from init_pid_ns.pid_
memcg: enable accounting for pids in nested pid namespaces
Commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg") enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep, but forgot to adjust the setting for nested pid namespaces. As a result, pid memory is not accounted exactly where it is really needed, inside memcg-limited containers with their own pid namespaces.
Pid was one the first kernel objects enabled for memcg accounting. init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any new pids in the system are memcg-accounted.
Though recently I've noticed that it is wrong. nested pid namespaces creates own slab caches for pid objects, nested pids have increased size because contain id both for all parent and for own pid namespaces. The problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a result any pids allocated in nested pid namespaces are not memcg-accounted.
Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000 such objects gives us up to ~50Mb unaccounted memory, this allow container to exceed assigned memcg limits.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg") Cc: [email protected] Signed-off-by: Vasily Averin <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Christian Brauner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1 |
|
| #
7b7b8a2c |
| 16-Oct-2020 |
Randy Dunlap <[email protected]> |
kernel/: fix repeated words in comments
Fix multiple occurrences of duplicated words in kernel/.
Fix one typo/spello on the same line as a duplicate word. Change one instance of "the the" to "that
kernel/: fix repeated words in comments
Fix multiple occurrences of duplicated words in kernel/.
Fix one typo/spello on the same line as a duplicate word. Change one instance of "the the" to "that the". Otherwise just drop one of the repeated words.
Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1 |
|
| #
8eb71d95 |
| 03-Aug-2020 |
Kirill Tkhai <[email protected]> |
pid: Use generic ns_common::count
Switch over pid namespaces to use the newly introduced common lifetime counter.
Currently every namespace type has its own lifetime counter which is stored in the
pid: Use generic ns_common::count
Switch over pid namespaces to use the newly introduced common lifetime counter.
Currently every namespace type has its own lifetime counter which is stored in the specific namespace struct. The lifetime counters are used identically for all namespaces types. Namespaces may of course have additional unrelated counters and these are not altered.
This introduces a common lifetime counter into struct ns_common. The ns_common struct encompasses information that all namespaces share. That should include the lifetime counter since its common for all of them.
It also allows us to unify the type of the counters across all namespaces. Most of them use refcount_t but one uses atomic_t and at least one uses kref. Especially the last one doesn't make much sense since it's just a wrapper around refcount_t since 2016 and actually complicates cleanup operations by having to use container_of() to cast the correct namespace struct out of struct ns_common.
Having the lifetime counter for the namespaces in one place reduces maintenance cost. Not just because after switching all namespaces over we will have removed more code than we added but also because the logic is more easily understandable and we indicate to the user that the basic lifetime requirements for all namespaces are currently identical.
Signed-off-by: Kirill Tkhai <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/159644979226.604812.7512601754841882036.stgit@localhost.localdomain Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v5.8, v5.8-rc7, v5.8-rc6 |
|
| #
b9a3db92 |
| 19-Jul-2020 |
Adrian Reber <[email protected]> |
pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow writing to ns_last_pid.
Signed-off-by: Adrian Reber <areber@re
pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow writing to ns_last_pid.
Signed-off-by: Adrian Reber <[email protected]> Signed-off-by: Nicolas Viennot <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5 |
|
| #
f2a8d52e |
| 05-May-2020 |
Christian Brauner <[email protected]> |
nsproxy: add struct nsset
Add a simple struct nsset. It holds all necessary pieces to switch to a new set of namespaces without leaving a task in a half-switched state which we will make use of in t
nsproxy: add struct nsset
Add a simple struct nsset. It holds all necessary pieces to switch to a new set of namespaces without leaving a task in a half-switched state which we will make use of in the next patch. This patch switches the existing setns logic over without causing a change in setns() behavior. This brings setns() closer to how unshare() works(). The prepare_ns() function is responsible to prepare all necessary information. This has two reasons. First it minimizes dependencies between individual namespaces, i.e. all install handler can expect that all fields are properly initialized independent in what order they are called in. Second, this makes the code easier to maintain and easier to follow if it needs to be changed.
The prepare_ns() helper will only be switched over to use a flags argument in the next patch. Here it will still use nstype as a simple integer argument which was argued would be clearer. I'm not particularly opinionated about this if it really helps or not. The struct nsset itself already contains the flags field since its name already indicates that it can contain information required by different namespaces. None of this should have functional consequences.
Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Aleksa Sarai <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.7-rc4, v5.7-rc3 |
|
| #
32927393 |
| 24-Apr-2020 |
Christoph Hellwig <[email protected]> |
sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from u
sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer.
As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical.
Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Andrey Ignatov <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v5.7-rc2, v5.7-rc1, v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4 |
|
| #
af9fe6d6 |
| 28-Feb-2020 |
Eric W. Biederman <[email protected]> |
pid: Improve the comment about waiting in zap_pid_ns_processes
Oleg wrote a very informative comment, but with the removal of proc_cleanup_work it is no longer accurate.
Rewrite the comment so that
pid: Improve the comment about waiting in zap_pid_ns_processes
Oleg wrote a very informative comment, but with the removal of proc_cleanup_work it is no longer accurate.
Rewrite the comment so that it only talks about the details that are still relevant, and hopefully is a little clearer.
Signed-off-by: "Eric W. Biederman" <[email protected]>
show more ...
|
|
Revision tags: v5.6-rc3 |
|
| #
69879c01 |
| 20-Feb-2020 |
Eric W. Biederman <[email protected]> |
proc: Remove the now unnecessary internal mount of proc
There remains no more code in the kernel using pids_ns->proc_mnt, therefore remove it from the kernel.
The big benefit of this change is that
proc: Remove the now unnecessary internal mount of proc
There remains no more code in the kernel using pids_ns->proc_mnt, therefore remove it from the kernel.
The big benefit of this change is that one of the most error prone and tricky parts of the pid namespace implementation, maintaining kernel mounts of proc is removed.
In addition removing the unnecessary complexity of the kernel mount fixes a regression that caused the proc mount options to be ignored. Now that the initial mount of proc comes from userspace, those mount options are again honored. This fixes Android's usage of the proc hidepid option.
Reported-by: Alistair Strachan <[email protected]> Fixes: e94591d0d90c ("proc: Convert proc_mount to use mount_ns.") Signed-off-by: "Eric W. Biederman" <[email protected]>
show more ...
|