|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
c4a16820 |
| 28-Jan-2025 |
Christian Brauner <[email protected]> |
fs: add open_tree_attr()
Add open_tree_attr() which allow to atomically create a detached mount tree and set mount options on it. If OPEN_TREE_CLONE is used this will allow the creation of a detache
fs: add open_tree_attr()
Add open_tree_attr() which allow to atomically create a detached mount tree and set mount options on it. If OPEN_TREE_CLONE is used this will allow the creation of a detached mount with a new set of mount options without it ever being exposed to userspace without that set of mount options applied.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: "Seth Forshee (DigitalOcean)" <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6 |
|
| #
6140be90 |
| 26-Apr-2024 |
Christian Göttsche <[email protected]> |
fs/xattr: add *at family syscalls
Add the four syscalls setxattrat(), getxattrat(), listxattrat() and removexattrat(). Those can be used to operate on extended attributes, especially security relat
fs/xattr: add *at family syscalls
Add the four syscalls setxattrat(), getxattrat(), listxattrat() and removexattrat(). Those can be used to operate on extended attributes, especially security related ones, either relative to a pinned directory or on a file descriptor without read access, avoiding a /proc/<pid>/fd/<fd> detour, requiring a mounted procfs.
One use case will be setfiles(8) setting SELinux file contexts ("security.selinux") without race conditions and without a file descriptor opened with read access requiring SELinux read permission.
Use the do_{name}at() pattern from fs/open.c.
Pass the value of the extended attribute, its length, and for setxattrat(2) the command (XATTR_CREATE or XATTR_REPLACE) via an added struct xattr_args to not exceed six syscall arguments and not merging the AT_* and XATTR_* flags.
[AV: fixes by Christian Brauner folded in, the entire thing rebased on top of {filename,file}_...xattr() primitives, treatment of empty pathnames regularized. As the result, AT_EMPTY_PATH+NULL handling is cheap, so f...(2) can use it]
Signed-off-by: Christian Göttsche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Arnd Bergmann <[email protected]> Reviewed-by: Christian Brauner <[email protected]> CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] [brauner: slight tweaks] Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5 |
|
| #
ff388fe5 |
| 15-Apr-2024 |
Jeff Xu <[email protected]> |
mseal: wire up mseal syscall
Patch series "Introduce mseal", v10.
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual mem
mseal: wire up mseal syscall
Patch series "Introduce mseal", v10.
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with following signature:
int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location, via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory.
The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API.
Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections.
In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD.
MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed.
To measure the performance impact of this loop, two tests are developed. [8]
The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results.
The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples.
Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook.
Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113%
The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111%
Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA.
In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114%
For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases.
It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147%
The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142%
From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.
In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect.
When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt.
This patch (of 5):
Wire up mseal syscall for all architectures.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Xu <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Jann Horn <[email protected]> [Bug #2] Cc: Jeff Xu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Jorge Lucangeli Obes <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muhammad Usama Anjum <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Stephen Röttger <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Amer Al Shanawany <[email protected]> Cc: Javier Carrasco <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2 |
|
| #
b973afe9 |
| 24-Jan-2024 |
Al Viro <[email protected]> |
alpha: add clone3() support
Since clone3() needs the full register state saved for copying into the child, it needs the same kind of wrapper as fork(), vfork() and clone(). Exact same wrapper works
alpha: add clone3() support
Since clone3() needs the full register state saved for copying into the child, it needs the same kind of wrapper as fork(), vfork() and clone(). Exact same wrapper works, actually...
Signed-off-by: Al Viro <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Matt Turner <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6 |
|
| #
d8b0f546 |
| 25-Oct-2023 |
Miklos Szeredi <[email protected]> |
wire up syscalls for statmount/listmount
Wire up all archs.
Signed-off-by: Miklos Szeredi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed
wire up syscalls for statmount/listmount
Wire up all archs.
Signed-off-by: Miklos Szeredi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Ian Kent <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2 |
|
| #
5f423759 |
| 12-Sep-2023 |
Casey Schaufler <[email protected]> |
LSM: wireup Linux Security Module syscalls
Wireup lsm_get_self_attr, lsm_set_self_attr and lsm_list_modules system calls.
Signed-off-by: Casey Schaufler <[email protected]> Reviewed-by: Kees C
LSM: wireup Linux Security Module syscalls
Wireup lsm_get_self_attr, lsm_set_self_attr and lsm_list_modules system calls.
Signed-off-by: Casey Schaufler <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Cc: [email protected] Reviewed-by: Mickaël Salaün <[email protected]> [PM: forward ported beyond v6.6 due merge window changes] Signed-off-by: Paul Moore <[email protected]>
show more ...
|
| #
dcc13451 |
| 16-Oct-2023 |
Peter Zijlstra <[email protected]> |
alpha: Fix up new futex syscall numbers
As per Arnd, Alpha syscalls since time64 are offset by 120, retain this offset.
Fixes: 9f6c532f59b2 ("futex: Add sys_futex_wake()") Reported-by: Arnd Bergman
alpha: Fix up new futex syscall numbers
As per Arnd, Alpha syscalls since time64 are offset by 120, retain this offset.
Fixes: 9f6c532f59b2 ("futex: Add sys_futex_wake()") Reported-by: Arnd Bergmann <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
2fd0ebad |
| 14-Sep-2023 |
Sohil Mehta <[email protected]> |
arch: Reserve map_shadow_stack() syscall number for all architectures
commit c35559f94ebc ("x86/shstk: Introduce map_shadow_stack syscall") recently added support for map_shadow_stack() but it is li
arch: Reserve map_shadow_stack() syscall number for all architectures
commit c35559f94ebc ("x86/shstk: Introduce map_shadow_stack syscall") recently added support for map_shadow_stack() but it is limited to x86 only for now. There is a possibility that other architectures (namely, arm64 and RISC-V), that are implementing equivalent support for shadow stacks, might need to add support for it.
Independent of that, reserving arch-specific syscall numbers in the syscall tables of all architectures is good practice and would help avoid future conflicts. map_shadow_stack() is marked as a conditional syscall in sys_ni.c. Adding it to the syscall tables of other architectures is harmless and would return ENOSYS when exercised.
Note, map_shadow_stack() was assigned #453 during the merge process since #452 was taken by fchmodat2().
For Powerpc, map it to sys_ni_syscall() as is the norm for Powerpc syscall tables.
For Alpha, map_shadow_stack() takes up #563 as Alpha still diverges from the common syscall numbering system in the other architectures.
Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Sohil Mehta <[email protected]> Reviewed-by: Rick Edgecombe <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Acked-by: Catalin Marinas <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2 |
|
| #
ccab211a |
| 10-Jul-2023 |
Sohil Mehta <[email protected]> |
syscalls: Cleanup references to sys_lookup_dcookie()
commit 'be65de6b03aa ("fs: Remove dcookies support")' removed the syscall definition for lookup_dcookie. However, syscall tables still point to
syscalls: Cleanup references to sys_lookup_dcookie()
commit 'be65de6b03aa ("fs: Remove dcookies support")' removed the syscall definition for lookup_dcookie. However, syscall tables still point to the old sys_lookup_dcookie() definition. Update syscall tables of all architectures to directly point to sys_ni_syscall() instead.
Signed-off-by: Sohil Mehta <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Acked-by: Namhyung Kim <[email protected]> # for perf Acked-by: Russell King (Oracle) <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
| #
0f4b5f97 |
| 21-Sep-2023 |
[email protected] <[email protected]> |
futex: Add sys_futex_requeue()
Finish off the 'simple' futex2 syscall group by adding sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are too numerous to fit into a regular syscall
futex: Add sys_futex_requeue()
Finish off the 'simple' futex2 syscall group by adding sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are too numerous to fit into a regular syscall. As such, use struct futex_waitv to pass the 'source' and 'destination' futexes to the syscall.
This syscall implements what was previously known as FUTEX_CMP_REQUEUE and uses {val, uaddr, flags} for source and {uaddr, flags} for destination.
This design explicitly allows requeueing between different types of futex by having a different flags word per uaddr.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
cb8c4312 |
| 21-Sep-2023 |
[email protected] <[email protected]> |
futex: Add sys_futex_wait()
To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This syscall implements what was previously known as FUTEX_WAIT_BITSET except it uses 'unsigned long' for th
futex: Add sys_futex_wait()
To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This syscall implements what was previously known as FUTEX_WAIT_BITSET except it uses 'unsigned long' for the value and bitmask arguments, takes timespec and clockid_t arguments for the absolute timeout and uses FUTEX2 flags.
The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
9f6c532f |
| 21-Sep-2023 |
[email protected] <[email protected]> |
futex: Add sys_futex_wake()
To complement sys_futex_waitv() add sys_futex_wake(). This syscall implements what was previously known as FUTEX_WAKE_BITSET except it uses 'unsigned long' for the bitmas
futex: Add sys_futex_wake()
To complement sys_futex_waitv() add sys_futex_wake(). This syscall implements what was previously known as FUTEX_WAKE_BITSET except it uses 'unsigned long' for the bitmask and takes FUTEX2 flags.
The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
78252deb |
| 11-Jul-2023 |
Palmer Dabbelt <[email protected]> |
arch: Register fchmodat2, usually as syscall 452
This registers the new fchmodat2 syscall in most places as nuber 452, with alpha being the exception where it's 562. I found all these sites by grep
arch: Register fchmodat2, usually as syscall 452
This registers the new fchmodat2 syscall in most places as nuber 452, with alpha being the exception where it's 562. I found all these sites by grepping for fspick, which I assume has found me everything.
Signed-off-by: Palmer Dabbelt <[email protected]> Signed-off-by: Alexey Gladkov <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Message-Id: <a677d521f048e4ca439e7080a5328f21eb8e960e.1689092120.git.legion@kernel.org> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2 |
|
| #
946e697c |
| 10-May-2023 |
Nhat Pham <[email protected]> |
cachestat: wire up cachestat for other architectures
cachestat is previously only wired in for x86 (and architectures using the generic unistd.h table):
https://lore.kernel.org/lkml/20230503013608.
cachestat: wire up cachestat for other architectures
cachestat is previously only wired in for x86 (and architectures using the generic unistd.h table):
https://lore.kernel.org/lkml/[email protected]/
This patch wires cachestat in for all the other architectures.
[[email protected]: wire up cachestat for arm64] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Tested-by: Michael Ellerman <[email protected]> [powerpc] Acked-by: Geert Uytterhoeven <[email protected]> [m68k] Reviewed-by: Arnd Bergmann <[email protected]> Acked-by: Heiko Carstens <[email protected]> [s390] Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Chris Zankel <[email protected]> Cc: David S. Miller <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1 |
|
| #
c6cc4f72 |
| 04-Oct-2022 |
Lukas Bulwahn <[email protected]> |
alpha: remove the needless aliases osf_{readv,writev}
Commit 987f20a9dcce ("a.out: Remove the a.out implementation") removes CONFIG_OSF4_COMPAT and its functionality. Hence, sys_osf_{readv,writev} a
alpha: remove the needless aliases osf_{readv,writev}
Commit 987f20a9dcce ("a.out: Remove the a.out implementation") removes CONFIG_OSF4_COMPAT and its functionality. Hence, sys_osf_{readv,writev} are now just aliases of sys_{readv,writev}.
Remove these needless aliases.
[ Identical patch also posted by Jason A. Donenfeld ]
Link: https://lore.kernel.org/lkml/CAHk-=wjwvBc3VQMNtUVUrMBVoMPSPu26OuatZ_+1gZ2m-PmmRA@mail.gmail.com/ Link: https://lore.kernel.org/all/[email protected]/ Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Lukas Bulwahn <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1 |
|
| #
21b084fd |
| 14-Jan-2022 |
Aneesh Kumar K.V <[email protected]> |
mm/mempolicy: wire up syscall set_mempolicy_home_node
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]>
mm/mempolicy: wire up syscall set_mempolicy_home_node
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: Ben Widawsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Feng Tang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Dan Williams <[email protected]> Cc: Huang Ying <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3 |
|
| #
a0eb2da9 |
| 24-Nov-2021 |
André Almeida <[email protected]> |
futex: Wireup futex_waitv syscall
Wireup futex_waitv syscall for all remaining archs.
Signed-off-by: André Almeida <[email protected]> Acked-by: Max Filippov <[email protected]> Acked-by:
futex: Wireup futex_waitv syscall
Wireup futex_waitv syscall for all remaining archs.
Signed-off-by: André Almeida <[email protected]> Acked-by: Max Filippov <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Tested-by: Michael Ellerman <[email protected]> (powerpc) Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
|
Revision tags: v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1 |
|
| #
dce49103 |
| 02-Sep-2021 |
Suren Baghdasaryan <[email protected]> |
mm: wire up syscall process_mrelease
Split off from prev patch in the series that implements the syscall.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Su
mm: wire up syscall process_mrelease
Split off from prev patch in the series that implements the syscall.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Florian Weimer <[email protected]> Cc: Jan Engelhardt <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tim Murray <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1 |
|
| #
b48c7236 |
| 29-Jun-2021 |
Eric W. Biederman <[email protected]> |
exit/bdflush: Remove the deprecated bdflush system call
The bdflush system call has been deprecated for a very long time. Recently Michael Schmitz tested[1] and found that the last known caller of o
exit/bdflush: Remove the deprecated bdflush system call
The bdflush system call has been deprecated for a very long time. Recently Michael Schmitz tested[1] and found that the last known caller of of the bdflush system call is unaffected by it's removal.
Since the code is not needed delete it.
[1] https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/87sg10quue.fsf_-_@disp2133 Tested-by: Michael Schmitz <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Acked-by: Cyril Hrubis <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
show more ...
|
|
Revision tags: v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5 |
|
| #
65ffb3d6 |
| 31-May-2021 |
Jan Kara <[email protected]> |
quota: Wire up quotactl_fd syscall
Wire up the quotactl_fd syscall.
Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jan Kara <[email protected]>
|
|
Revision tags: v5.13-rc4, v5.13-rc3 |
|
| #
5b9fedb3 |
| 17-May-2021 |
Jan Kara <[email protected]> |
quota: Disable quotactl_path syscall
In commit fa8b90070a80 ("quota: wire up quotactl_path") we have wired up new quotactl_path syscall. However some people in LWN discussion have objected that the
quota: Disable quotactl_path syscall
In commit fa8b90070a80 ("quota: wire up quotactl_path") we have wired up new quotactl_path syscall. However some people in LWN discussion have objected that the path based syscall is missing dirfd and flags argument which is mostly standard for contemporary path based syscalls. Indeed they have a point and after a discussion with Christian Brauner and Sascha Hauer I've decided to disable the syscall for now and update its API. Since there is no userspace currently using that syscall and it hasn't been released in any major release, we should be fine.
CC: Christian Brauner <[email protected]> CC: Sascha Hauer <[email protected]> Link: https://lore.kernel.org/lkml/20210512153621.n5u43jsytbik4yze@wittgenstein Signed-off-by: Jan Kara <[email protected]>
show more ...
|
|
Revision tags: v5.13-rc2, v5.13-rc1, v5.12 |
|
| #
a49f4f81 |
| 22-Apr-2021 |
Mickaël Salaün <[email protected]> |
arch: Wire up Landlock syscalls
Wire up the following system calls for all architectures: * landlock_create_ruleset(2) * landlock_add_rule(2) * landlock_restrict_self(2)
Cc: Arnd Bergmann <arnd@arn
arch: Wire up Landlock syscalls
Wire up the following system calls for all architectures: * landlock_create_ruleset(2) * landlock_add_rule(2) * landlock_restrict_self(2)
Cc: Arnd Bergmann <[email protected]> Cc: James Morris <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kees Cook <[email protected]> Cc: Serge E. Hallyn <[email protected]> Signed-off-by: Mickaël Salaün <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: James Morris <[email protected]>
show more ...
|
|
Revision tags: v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2 |
|
| #
fa8b9007 |
| 04-Mar-2021 |
Sascha Hauer <[email protected]> |
quota: wire up quotactl_path
Wire up the quotactl_path syscall added in the previous patch.
Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sascha Hauer
quota: wire up quotactl_path
Wire up the quotactl_path syscall added in the previous patch.
Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sascha Hauer <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jan Kara <[email protected]>
show more ...
|
|
Revision tags: v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5 |
|
| #
2a186721 |
| 21-Jan-2021 |
Christian Brauner <[email protected]> |
fs: add mount_setattr()
This implements the missing mount_setattr() syscall. While the new mount api allows to change the properties of a superblock there is currently no way to change the propertie
fs: add mount_setattr()
This implements the missing mount_setattr() syscall. While the new mount api allows to change the properties of a superblock there is currently no way to change the properties of a mount or a mount tree using file descriptors which the new mount api is based on. In addition the old mount api has the restriction that mount options cannot be applied recursively. This hasn't changed since changing mount options on a per-mount basis was implemented in [1] and has been a frequent request not just for convenience but also for security reasons. The legacy mount syscall is unable to accommodate this behavior without introducing a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost mount. Changing MS_REC to apply to the whole mount tree would mean introducing a significant uapi change and would likely cause significant regressions.
The new mount_setattr() syscall allows to recursively clear and set mount options in one shot. Multiple calls to change mount options requesting the same changes are idempotent:
int mount_setattr(int dfd, const char *path, unsigned flags, struct mount_attr *uattr, size_t usize);
Flags to modify path resolution behavior are specified in the @flags argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW, and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to restrict path resolution as introduced with openat2() might be supported in the future.
The mount_setattr() syscall can be expected to grow over time and is designed with extensibility in mind. It follows the extensible syscall pattern we have used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and others. The set of mount options is passed in the uapi struct mount_attr which currently has the following layout:
struct mount_attr { __u64 attr_set; __u64 attr_clr; __u64 propagation; __u64 userns_fd; };
The @attr_set and @attr_clr members are used to clear and set mount options. This way a user can e.g. request that a set of flags is to be raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in @attr_set while at the same time requesting that another set of flags is to be lowered such as removing noexec from a mount tree by specifying MOUNT_ATTR_NOEXEC in @attr_clr.
Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0, not a bitmap, users wanting to transition to a different atime setting cannot simply specify the atime setting in @attr_set, but must also specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in @attr_clr.
The @propagation field lets callers specify the propagation type of a mount tree. Propagation is a single property that has four different settings and as such is not really a flag argument but an enum. Specifically, it would be unclear what setting and clearing propagation settings in combination would amount to. The legacy mount() syscall thus forbids the combination of multiple propagation settings too. The goal is to keep the semantics of mount propagation somewhat simple as they are overly complex as it is.
The @userns_fd field lets user specify a user namespace whose idmapping becomes the idmapping of the mount. This is implemented and explained in detail in the next patch.
[1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Link: https://lore.kernel.org/r/[email protected] Cc: David Howells <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Al Viro <[email protected]> Cc: [email protected] Cc: [email protected] Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1 |
|
| #
b0a0c261 |
| 18-Dec-2020 |
Willem de Bruijn <[email protected]> |
epoll: wire up syscall epoll_pwait2
Split off from prev patch in the series that implements the syscall.
Link: https://lkml.kernel.org/r/[email protected] Sig
epoll: wire up syscall epoll_pwait2
Split off from prev patch in the series that implements the syscall.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Willem de Bruijn <[email protected]> Cc: Al Viro <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|