|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
29d80d50 |
| 21-Jan-2025 |
Yuichiro Tsuji <[email protected]> |
open: Fix return type of several functions from long to int
Fix the return type of several functions from long to int to match its actu al behavior. These functions only return int values. This chan
open: Fix return type of several functions from long to int
Fix the return type of several functions from long to int to match its actu al behavior. These functions only return int values. This change improves type consistency across the filesystem code and aligns the function signatu re with its existing implementation and usage.
Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Yuichiro Tsuji <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
c4a16820 |
| 28-Jan-2025 |
Christian Brauner <[email protected]> |
fs: add open_tree_attr()
Add open_tree_attr() which allow to atomically create a detached mount tree and set mount options on it. If OPEN_TREE_CLONE is used this will allow the creation of a detache
fs: add open_tree_attr()
Add open_tree_attr() which allow to atomically create a detached mount tree and set mount options on it. If OPEN_TREE_CLONE is used this will allow the creation of a detached mount with a new set of mount options without it ever being exposed to userspace without that set of mount options applied.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: "Seth Forshee (DigitalOcean)" <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6 |
|
| #
6140be90 |
| 26-Apr-2024 |
Christian Göttsche <[email protected]> |
fs/xattr: add *at family syscalls
Add the four syscalls setxattrat(), getxattrat(), listxattrat() and removexattrat(). Those can be used to operate on extended attributes, especially security relat
fs/xattr: add *at family syscalls
Add the four syscalls setxattrat(), getxattrat(), listxattrat() and removexattrat(). Those can be used to operate on extended attributes, especially security related ones, either relative to a pinned directory or on a file descriptor without read access, avoiding a /proc/<pid>/fd/<fd> detour, requiring a mounted procfs.
One use case will be setfiles(8) setting SELinux file contexts ("security.selinux") without race conditions and without a file descriptor opened with read access requiring SELinux read permission.
Use the do_{name}at() pattern from fs/open.c.
Pass the value of the extended attribute, its length, and for setxattrat(2) the command (XATTR_CREATE or XATTR_REPLACE) via an added struct xattr_args to not exceed six syscall arguments and not merging the AT_* and XATTR_* flags.
[AV: fixes by Christian Brauner folded in, the entire thing rebased on top of {filename,file}_...xattr() primitives, treatment of empty pathnames regularized. As the result, AT_EMPTY_PATH+NULL handling is cheap, so f...(2) can use it]
Signed-off-by: Christian Göttsche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Arnd Bergmann <[email protected]> Reviewed-by: Christian Brauner <[email protected]> CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] CC: [email protected] [brauner: slight tweaks] Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
4356d575 |
| 28-Aug-2024 |
Aleksa Sarai <[email protected]> |
fhandle: expose u64 mount id to name_to_handle_at(2)
Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file
fhandle: expose u64 mount id to name_to_handle_at(2)
Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2).
While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call, turning
err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid, AT_HANDLE_MNT_ID_UNIQUE);
into
int fd = openat(-EBADF, "/foo/bar/baz", O_PATH | O_CLOEXEC); err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH); err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf); mntid = statxbuf.stx_mnt_id; close(fd);
Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Aleksa Sarai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
63e2f40c |
| 29-Jun-2024 |
Arnd Bergmann <[email protected]> |
syscalls: fix sys_fanotify_mark prototype
My earlier fix missed an incorrect function prototype that shows up on native 32-bit builds:
In file included from fs/notify/fanotify/fanotify_user.c:14: i
syscalls: fix sys_fanotify_mark prototype
My earlier fix missed an incorrect function prototype that shows up on native 32-bit builds:
In file included from fs/notify/fanotify/fanotify_user.c:14: include/linux/syscalls.h:248:25: error: conflicting types for 'sys_fanotify_mark'; have 'long int(int, unsigned int, u32, u32, int, const char *)' {aka 'long int(int, unsigned int, unsigned int, unsigned int, int, const char *)'} 1924 | SYSCALL32_DEFINE6(fanotify_mark, | ^~~~~~~~~~~~~~~~~ include/linux/syscalls.h:862:17: note: previous declaration of 'sys_fanotify_mark' with type 'long int(int, unsigned int, u64, int, const char *)' {aka 'long int(int, unsigned int, long long unsigned int, int, const char *)'}
On x86 and powerpc, the prototype is also wrong but hidden in an #ifdef, so it never caused problems.
Add another alternative declaration that matches the conditional function definition.
Fixes: 403f17a33073 ("parisc: use generic sys_fanotify_mark implementation") Cc: [email protected] Reported-by: Guenter Roeck <[email protected]> Reported-by: Geert Uytterhoeven <[email protected]> Reported-by: kernel test robot <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
| #
0fa8ab5f |
| 20-Jun-2024 |
Arnd Bergmann <[email protected]> |
linux/syscalls.h: add missing __user annotations
A couple of declarations in linux/syscalls.h are missing __user annotations on their pointers, which can lead to warnings from sparse because these d
linux/syscalls.h: add missing __user annotations
A couple of declarations in linux/syscalls.h are missing __user annotations on their pointers, which can lead to warnings from sparse because these don't match the implementation that have the correct address space annotations.
Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
| #
4b8e88e5 |
| 19-Jun-2024 |
Arnd Bergmann <[email protected]> |
ftruncate: pass a signed offset
The old ftruncate() syscall, using the 32-bit off_t misses a sign extension when called in compat mode on 64-bit architectures. As a result, passing a negative lengt
ftruncate: pass a signed offset
The old ftruncate() syscall, using the 32-bit off_t misses a sign extension when called in compat mode on 64-bit architectures. As a result, passing a negative length accidentally succeeds in truncating to file size between 2GiB and 4GiB.
Changing the type of the compat syscall to the signed compat_off_t changes the behavior so it instead returns -EINVAL.
The native entry point, the truncate() syscall and the corresponding loff_t based variants are all correct already and do not suffer from this mistake.
Fixes: 3f6d078d4acc ("fix compat truncate/ftruncate") Reviewed-by: Christian Brauner <[email protected]> Cc: [email protected] Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
| #
190fec72 |
| 11-Jun-2024 |
Jiri Olsa <[email protected]> |
uprobe: Wire up uretprobe system call
Wiring up uretprobe system call, which comes in following changes. We need to do the wiring before, because the uretprobe implementation needs the syscall numbe
uprobe: Wire up uretprobe system call
Wiring up uretprobe system call, which comes in following changes. We need to do the wiring before, because the uretprobe implementation needs the syscall number.
Note at the moment uretprobe syscall is supported only for native 64-bit process.
Link: https://lore.kernel.org/all/[email protected]/
Reviewed-by: Oleg Nesterov <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5 |
|
| #
8be7258a |
| 15-Apr-2024 |
Jeff Xu <[email protected]> |
mseal: add mseal syscall
The new mseal() is an syscall on 64 bit CPU, and with following signature:
int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved.
m
mseal: add mseal syscall
The new mseal() is an syscall on 64 bit CPU, and with following signature:
int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location, via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory.
Following input during RFC are incooperated into this patch:
Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Linus Torvalds: assisting in defining system call signature and scope. Liam R. Howlett: perf optimization. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD.
Finally, the idea that inspired this patch comes from Stephen Röttger's work in Chrome V8 CFI.
[[email protected]: add branch prediction hint, per Pedro] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Xu <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Jorge Lucangeli Obes <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muhammad Usama Anjum <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Stephen Röttger <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Amer Al Shanawany <[email protected]> Cc: Javier Carrasco <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
a5a858f6 |
| 14-Mar-2024 |
Casey Schaufler <[email protected]> |
lsm: use 32-bit compatible data types in LSM syscalls
Change the size parameters in lsm_list_modules(), lsm_set_self_attr() and lsm_get_self_attr() from size_t to u32. This avoids the need to have d
lsm: use 32-bit compatible data types in LSM syscalls
Change the size parameters in lsm_list_modules(), lsm_set_self_attr() and lsm_get_self_attr() from size_t to u32. This avoids the need to have different interfaces for 32 and 64 bit systems.
Cc: [email protected] Fixes: a04a1198088a ("LSM: syscalls for current process attributes") Fixes: ad4aff9ec25f ("LSM: Create lsm_list_modules system call") Signed-off-by: Casey Schaufler <[email protected]> Reported-and-reviewed-by: Dmitry V. Levin <[email protected]> [PM: subject and metadata tweaks, syscall.h fixes] Signed-off-by: Paul Moore <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1 |
|
| #
56062d60 |
| 10-Jan-2024 |
Richard Palethorpe <[email protected]> |
x86/entry/ia32: Ensure s32 is sign extended to s64
Presently ia32 registers stored in ptregs are unconditionally cast to unsigned int by the ia32 stub. They are then cast to long when passed to __se
x86/entry/ia32: Ensure s32 is sign extended to s64
Presently ia32 registers stored in ptregs are unconditionally cast to unsigned int by the ia32 stub. They are then cast to long when passed to __se_sys*, but will not be sign extended.
This takes the sign of the syscall argument into account in the ia32 stub. It still casts to unsigned int to avoid implementation specific behavior. However then casts to int or unsigned int as necessary. So that the following cast to long sign extends the value.
This fixes the io_pgetevents02 LTP test when compiled with -m32. Presently the systemcall io_pgetevents_time64() unexpectedly accepts -1 for the maximum number of events.
It doesn't appear other systemcalls with signed arguments are effected because they all have compat variants defined and wired up.
Fixes: ebeb8c82ffaf ("syscalls/x86: Use 'struct pt_regs' based syscall calling for IA32_EMULATION and x32") Suggested-by: Arnd Bergmann <[email protected]> Signed-off-by: Richard Palethorpe <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/ltp/[email protected]/
show more ...
|
| #
ba5afb9a |
| 12-Jan-2024 |
Christian Brauner <[email protected]> |
fs: rework listmount() implementation
Linus pointed out that there's error handling and naming issues in the that we should rewrite:
* Perform the access checks for the buffer before actually doing
fs: rework listmount() implementation
Linus pointed out that there's error handling and naming issues in the that we should rewrite:
* Perform the access checks for the buffer before actually doing any work instead of doing it during the iteration. * Rename the arguments to listmount() and do_listmount() to clarify what the arguments are used for. * Get rid of the pointless ctr variable and overflow checking. * Get rid of the pointless speculation check.
Link: https://lore.kernel.org/r/CAHk-=wjh6Cypo8WC-McXgSzCaou3UXccxB+7PVeSuGR8AjCphg@mail.gmail.com Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6 |
|
| #
b4c2bea8 |
| 25-Oct-2023 |
Miklos Szeredi <[email protected]> |
add listmount(2) syscall
Add way to query the children of a particular mount. This is a more flexible way to iterate the mount tree than having to parse /proc/self/mountinfo.
Lookup the mount by t
add listmount(2) syscall
Add way to query the children of a particular mount. This is a more flexible way to iterate the mount tree than having to parse /proc/self/mountinfo.
Lookup the mount by the new 64bit mount ID. If a mount needs to be queried based on path, then statx(2) can be used to first query the mount ID belonging to the path.
Return an array of new (64bit) mount ID's. Without privileges only mounts are listed which are reachable from the task's root.
Folded into this patch are several later improvements. Keeping them separate would make the history pointlessly confusing:
* Recursive listing of mounts is the default now (cf. [1]). * Remove explicit LISTMOUNT_UNREACHABLE flag (cf. [1]) and fail if mount is unreachable from current root. This also makes permission checking consistent with statmount() (cf. [3]). * Start listing mounts in unique mount ID order (cf. [2]) to allow continuing listmount() from a midpoint. * Allow to continue listmount(). The @request_mask parameter is renamed and to @param to be usable by both statmount() and listmount(). If @param is set to a mount id then listmount() will continue listing mounts from that id on. This allows listing mounts in multiple listmount invocations without having to resize the buffer. If @param is zero then the listing starts from the beginning (cf. [4]). * Don't return EOVERFLOW, instead return the buffer size which allows to detect a full buffer as well (cf. [4]).
Signed-off-by: Miklos Szeredi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Ian Kent <[email protected]> Link: https://lore.kernel.org/r/[email protected] [1] (folded) Link: https://lore.kernel.org/r/[email protected] [2] (folded) Link: https://lore.kernel.org/r/[email protected] [3] (folded) Link: https://lore.kernel.org/r/[email protected] [4] (folded) [Christian Brauner <[email protected]>: various smaller fixes] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
46eae99e |
| 25-Oct-2023 |
Miklos Szeredi <[email protected]> |
add statmount(2) syscall
Add a way to query attributes of a single mount instead of having to parse the complete /proc/$PID/mountinfo, which might be huge.
Lookup the mount the new 64bit mount ID.
add statmount(2) syscall
Add a way to query attributes of a single mount instead of having to parse the complete /proc/$PID/mountinfo, which might be huge.
Lookup the mount the new 64bit mount ID. If a mount needs to be queried based on path, then statx(2) can be used to first query the mount ID belonging to the path.
Design is based on a suggestion by Linus:
"So I'd suggest something that is very much like "statfsat()", which gets a buffer and a length, and returns an extended "struct statfs" *AND* just a string description at the end."
The interface closely mimics that of statx.
Handle ASCII attributes by appending after the end of the structure (as per above suggestion). Pointers to strings are stored in u64 members to make the structure the same regardless of pointer size. Strings are nul terminated.
Link: https://lore.kernel.org/all/CAHk-=wh5YifP7hzKSbwJj94+DZ2czjrZsczy6GBimiogZws=rg@mail.gmail.com/ Signed-off-by: Miklos Szeredi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Ian Kent <[email protected]> [Christian Brauner <[email protected]>: various minor changes] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2 |
|
| #
ad4aff9e |
| 12-Sep-2023 |
Casey Schaufler <[email protected]> |
LSM: Create lsm_list_modules system call
Create a system call to report the list of Linux Security Modules that are active on the system. The list is provided as an array of LSM ID numbers.
The cal
LSM: Create lsm_list_modules system call
Create a system call to report the list of Linux Security Modules that are active on the system. The list is provided as an array of LSM ID numbers.
The calling application can use this list determine what LSM specific actions it might take. That might include choosing an output format, determining required privilege or bypassing security module specific behavior.
Signed-off-by: Casey Schaufler <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Reviewed-by: John Johansen <[email protected]> Reviewed-by: Mickaël Salaün <[email protected]> Signed-off-by: Paul Moore <[email protected]>
show more ...
|
| #
a04a1198 |
| 12-Sep-2023 |
Casey Schaufler <[email protected]> |
LSM: syscalls for current process attributes
Create a system call lsm_get_self_attr() to provide the security module maintained attributes of the current process. Create a system call lsm_set_self_a
LSM: syscalls for current process attributes
Create a system call lsm_get_self_attr() to provide the security module maintained attributes of the current process. Create a system call lsm_set_self_attr() to set a security module maintained attribute of the current process. Historically these attributes have been exposed to user space via entries in procfs under /proc/self/attr.
The attribute value is provided in a lsm_ctx structure. The structure identifies the size of the attribute, and the attribute value. The format of the attribute value is defined by the security module. A flags field is included for LSM specific information. It is currently unused and must be 0. The total size of the data, including the lsm_ctx structure and any padding, is maintained as well.
struct lsm_ctx { __u64 id; __u64 flags; __u64 len; __u64 ctx_len; __u8 ctx[]; };
Two new LSM hooks are used to interface with the LSMs. security_getselfattr() collects the lsm_ctx values from the LSMs that support the hook, accounting for space requirements. security_setselfattr() identifies which LSM the attribute is intended for and passes it along.
Signed-off-by: Casey Schaufler <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Reviewed-by: John Johansen <[email protected]> Signed-off-by: Paul Moore <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2 |
|
| #
ccab211a |
| 10-Jul-2023 |
Sohil Mehta <[email protected]> |
syscalls: Cleanup references to sys_lookup_dcookie()
commit 'be65de6b03aa ("fs: Remove dcookies support")' removed the syscall definition for lookup_dcookie. However, syscall tables still point to
syscalls: Cleanup references to sys_lookup_dcookie()
commit 'be65de6b03aa ("fs: Remove dcookies support")' removed the syscall definition for lookup_dcookie. However, syscall tables still point to the old sys_lookup_dcookie() definition. Update syscall tables of all architectures to directly point to sys_ni_syscall() instead.
Signed-off-by: Sohil Mehta <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Acked-by: Namhyung Kim <[email protected]> # for perf Acked-by: Russell King (Oracle) <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
| #
0f4b5f97 |
| 21-Sep-2023 |
[email protected] <[email protected]> |
futex: Add sys_futex_requeue()
Finish off the 'simple' futex2 syscall group by adding sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are too numerous to fit into a regular syscall
futex: Add sys_futex_requeue()
Finish off the 'simple' futex2 syscall group by adding sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are too numerous to fit into a regular syscall. As such, use struct futex_waitv to pass the 'source' and 'destination' futexes to the syscall.
This syscall implements what was previously known as FUTEX_CMP_REQUEUE and uses {val, uaddr, flags} for source and {uaddr, flags} for destination.
This design explicitly allows requeueing between different types of futex by having a different flags word per uaddr.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
cb8c4312 |
| 21-Sep-2023 |
[email protected] <[email protected]> |
futex: Add sys_futex_wait()
To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This syscall implements what was previously known as FUTEX_WAIT_BITSET except it uses 'unsigned long' for th
futex: Add sys_futex_wait()
To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This syscall implements what was previously known as FUTEX_WAIT_BITSET except it uses 'unsigned long' for the value and bitmask arguments, takes timespec and clockid_t arguments for the absolute timeout and uses FUTEX2 flags.
The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
9f6c532f |
| 21-Sep-2023 |
[email protected] <[email protected]> |
futex: Add sys_futex_wake()
To complement sys_futex_waitv() add sys_futex_wake(). This syscall implements what was previously known as FUTEX_WAKE_BITSET except it uses 'unsigned long' for the bitmas
futex: Add sys_futex_wake()
To complement sys_futex_waitv() add sys_futex_wake(). This syscall implements what was previously known as FUTEX_WAKE_BITSET except it uses 'unsigned long' for the bitmask and takes FUTEX2 flags.
The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
1dfe3a5a |
| 21-Aug-2023 |
Mark Rutland <[email protected]> |
entry: Remove empty addr_limit_user_check()
Back when set_fs() was a generic API for altering the address limit, addr_limit_user_check() was a safety measure to prevent userspace being able to issue
entry: Remove empty addr_limit_user_check()
Back when set_fs() was a generic API for altering the address limit, addr_limit_user_check() was a safety measure to prevent userspace being able to issue syscalls with an unbound limit.
With the the removal of set_fs() as a generic API, the last user of addr_limit_user_check() was removed in commit:
b5a5a01d8e9a44ec ("arm64: uaccess: remove addr_limit_user_check()")
... as since that commit, no architecture defines TIF_FSCHECK, and hence addr_limit_user_check() always expands to nothing.
Remove addr_limit_user_check(), updating the comment in exit_to_user_mode_prepare() to no longer refer to it. At the same time, the comment is reworded to be a little more generic so as to cover kmap_assert_nomap() in addition to lockdep_sys_exit().
No functional change.
Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.5-rc1, v6.4, v6.4-rc7 |
|
| #
c35559f9 |
| 13-Jun-2023 |
Rick Edgecombe <[email protected]> |
x86/shstk: Introduce map_shadow_stack syscall
When operating with shadow stacks enabled, the kernel will automatically allocate shadow stacks for new threads, however in some cases userspace will ne
x86/shstk: Introduce map_shadow_stack syscall
When operating with shadow stacks enabled, the kernel will automatically allocate shadow stacks for new threads, however in some cases userspace will need additional shadow stacks. The main example of this is the ucontext family of functions, which require userspace allocating and pivoting to userspace managed stacks.
Unlike most other user memory permissions, shadow stacks need to be provisioned with special data in order to be useful. They need to be setup with a restore token so that userspace can pivot to them via the RSTORSSP instruction. But, the security design of shadow stacks is that they should not be written to except in limited circumstances. This presents a problem for userspace, as to how userspace can provision this special data, without allowing for the shadow stack to be generally writable.
Previously, a new PROT_SHADOW_STACK was attempted, which could be mprotect()ed from RW permissions after the data was provisioned. This was found to not be secure enough, as other threads could write to the shadow stack during the writable window.
The kernel can use a special instruction, WRUSS, to write directly to userspace shadow stacks. So the solution can be that memory can be mapped as shadow stack permissions from the beginning (never generally writable in userspace), and the kernel itself can write the restore token.
First, a new madvise() flag was explored, which could operate on the PROT_SHADOW_STACK memory. This had a couple of downsides: 1. Extra checks were needed in mprotect() to prevent writable memory from ever becoming PROT_SHADOW_STACK. 2. Extra checks/vma state were needed in the new madvise() to prevent restore tokens being written into the middle of pre-used shadow stacks. It is ideal to prevent restore tokens being added at arbitrary locations, so the check was to make sure the shadow stack had never been written to. 3. It stood out from the rest of the madvise flags, as more of direct action than a hint at future desired behavior.
So rather than repurpose two existing syscalls (mmap, madvise) that don't quite fit, just implement a new map_shadow_stack syscall to allow userspace to map and setup new shadow stacks in one step. While ucontext is the primary motivator, userspace may have other unforeseen reasons to setup its own shadow stacks using the WRSS instruction. Towards this provide a flag so that stacks can be optionally setup securely for the common case of ucontext without enabling WRSS. Or potentially have the kernel set up the shadow stack in some new way.
The following example demonstrates how to create a new shadow stack with map_shadow_stack: void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);
Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Tested-by: Pengfei Xu <[email protected]> Tested-by: John Allen <[email protected]> Tested-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/all/20230613001108.3040476-35-rick.p.edgecombe%40intel.com
show more ...
|
| #
09da082b |
| 11-Jul-2023 |
Alexey Gladkov <[email protected]> |
fs: Add fchmodat2()
On the userspace side fchmodat(3) is implemented as a wrapper function which implements the POSIX-specified interface. This interface differs from the underlying kernel system ca
fs: Add fchmodat2()
On the userspace side fchmodat(3) is implemented as a wrapper function which implements the POSIX-specified interface. This interface differs from the underlying kernel system call, which does not have a flags argument. Most implementations require procfs [1][2].
There doesn't appear to be a good userspace workaround for this issue but the implementation in the kernel is pretty straight-forward.
The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag, unlike existing fchmodat.
[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [2] https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
Co-developed-by: Palmer Dabbelt <[email protected]> Signed-off-by: Palmer Dabbelt <[email protected]> Signed-off-by: Alexey Gladkov <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Message-Id: <f2a846ef495943c5d101011eebcf01179d0c7b61.1689092120.git.legion@kernel.org> [brauner: pre reviews, do flag conversion in do_fchmodat() directly] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
06a02139 |
| 11-Jul-2023 |
Palmer Dabbelt <[email protected]> |
Non-functional cleanup of a "__user * filename"
The next patch defines a very similar interface, which I copied from this definition. Since I'm touching it anyway I don't see any reason not to just
Non-functional cleanup of a "__user * filename"
The next patch defines a very similar interface, which I copied from this definition. Since I'm touching it anyway I don't see any reason not to just go fix this one up.
Signed-off-by: Palmer Dabbelt <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Message-Id: <ef644540cfd8717f30bcc5e4c32f06c80b6c156e.1689092120.git.legion@kernel.org> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
4dd595c3 |
| 21-Jun-2023 |
Sohil Mehta <[email protected]> |
syscalls: Remove file path comments from headers
Source file locations for syscall definitions can change over a period of time. File paths in comments get stale and are hard to maintain long term.
syscalls: Remove file path comments from headers
Source file locations for syscall definitions can change over a period of time. File paths in comments get stale and are hard to maintain long term. Also, their usefulness is questionable since it would be easier to locate a syscall definition using the SYSCALL_DEFINEx() macro.
Remove all source file path comments from the syscall headers. Also, equalize the uneven line spacing (some of which is introduced due to the deletions).
Signed-off-by: Sohil Mehta <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|