1edba5eecSFederico Vaga 2edba5eecSFederico Vaga.. _addsyscalls: 3edba5eecSFederico Vaga 4186128f7SMauro Carvalho ChehabAdding a New System Call 5186128f7SMauro Carvalho Chehab======================== 6186128f7SMauro Carvalho Chehab 7186128f7SMauro Carvalho ChehabThis document describes what's involved in adding a new system call to the 8186128f7SMauro Carvalho ChehabLinux kernel, over and above the normal submission advice in 98c27ceffSMauro Carvalho Chehab:ref:`Documentation/process/submitting-patches.rst <submittingpatches>`. 10186128f7SMauro Carvalho Chehab 11186128f7SMauro Carvalho Chehab 12186128f7SMauro Carvalho ChehabSystem Call Alternatives 13186128f7SMauro Carvalho Chehab------------------------ 14186128f7SMauro Carvalho Chehab 15186128f7SMauro Carvalho ChehabThe first thing to consider when adding a new system call is whether one of 16186128f7SMauro Carvalho Chehabthe alternatives might be suitable instead. Although system calls are the 17186128f7SMauro Carvalho Chehabmost traditional and most obvious interaction points between userspace and the 18186128f7SMauro Carvalho Chehabkernel, there are other possibilities -- choose what fits best for your 19186128f7SMauro Carvalho Chehabinterface. 20186128f7SMauro Carvalho Chehab 21186128f7SMauro Carvalho Chehab - If the operations involved can be made to look like a filesystem-like 22186128f7SMauro Carvalho Chehab object, it may make more sense to create a new filesystem or device. This 23186128f7SMauro Carvalho Chehab also makes it easier to encapsulate the new functionality in a kernel module 24186128f7SMauro Carvalho Chehab rather than requiring it to be built into the main kernel. 25186128f7SMauro Carvalho Chehab 26186128f7SMauro Carvalho Chehab - If the new functionality involves operations where the kernel notifies 27186128f7SMauro Carvalho Chehab userspace that something has happened, then returning a new file 28186128f7SMauro Carvalho Chehab descriptor for the relevant object allows userspace to use 29186128f7SMauro Carvalho Chehab ``poll``/``select``/``epoll`` to receive that notification. 30186128f7SMauro Carvalho Chehab - However, operations that don't map to 31186128f7SMauro Carvalho Chehab :manpage:`read(2)`/:manpage:`write(2)`-like operations 32186128f7SMauro Carvalho Chehab have to be implemented as :manpage:`ioctl(2)` requests, which can lead 33186128f7SMauro Carvalho Chehab to a somewhat opaque API. 34186128f7SMauro Carvalho Chehab 35186128f7SMauro Carvalho Chehab - If you're just exposing runtime system information, a new node in sysfs 360c1bc6b8SMauro Carvalho Chehab (see ``Documentation/filesystems/sysfs.rst``) or the ``/proc`` filesystem may 37186128f7SMauro Carvalho Chehab be more appropriate. However, access to these mechanisms requires that the 38186128f7SMauro Carvalho Chehab relevant filesystem is mounted, which might not always be the case (e.g. 39186128f7SMauro Carvalho Chehab in a namespaced/sandboxed/chrooted environment). Avoid adding any API to 40186128f7SMauro Carvalho Chehab debugfs, as this is not considered a 'production' interface to userspace. 41186128f7SMauro Carvalho Chehab - If the operation is specific to a particular file or file descriptor, then 42186128f7SMauro Carvalho Chehab an additional :manpage:`fcntl(2)` command option may be more appropriate. However, 43186128f7SMauro Carvalho Chehab :manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so 44186128f7SMauro Carvalho Chehab this option is best for when the new function is closely analogous to 45186128f7SMauro Carvalho Chehab existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple 46186128f7SMauro Carvalho Chehab (for example, getting/setting a simple flag related to a file descriptor). 47186128f7SMauro Carvalho Chehab - If the operation is specific to a particular task or process, then an 48186128f7SMauro Carvalho Chehab additional :manpage:`prctl(2)` command option may be more appropriate. As 49186128f7SMauro Carvalho Chehab with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so 50186128f7SMauro Carvalho Chehab is best reserved for near-analogs of existing ``prctl()`` commands or 51186128f7SMauro Carvalho Chehab getting/setting a simple flag related to a process. 52186128f7SMauro Carvalho Chehab 53186128f7SMauro Carvalho Chehab 54186128f7SMauro Carvalho ChehabDesigning the API: Planning for Extension 55186128f7SMauro Carvalho Chehab----------------------------------------- 56186128f7SMauro Carvalho Chehab 57186128f7SMauro Carvalho ChehabA new system call forms part of the API of the kernel, and has to be supported 58186128f7SMauro Carvalho Chehabindefinitely. As such, it's a very good idea to explicitly discuss the 59186128f7SMauro Carvalho Chehabinterface on the kernel mailing list, and it's important to plan for future 60186128f7SMauro Carvalho Chehabextensions of the interface. 61186128f7SMauro Carvalho Chehab 62186128f7SMauro Carvalho Chehab(The syscall table is littered with historical examples where this wasn't done, 63186128f7SMauro Carvalho Chehabtogether with the corresponding follow-up system calls -- 64186128f7SMauro Carvalho Chehab``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``, 65186128f7SMauro Carvalho Chehab``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so 66186128f7SMauro Carvalho Chehablearn from the history of the kernel and plan for extensions from the start.) 67186128f7SMauro Carvalho Chehab 68186128f7SMauro Carvalho ChehabFor simpler system calls that only take a couple of arguments, the preferred 69186128f7SMauro Carvalho Chehabway to allow for future extensibility is to include a flags argument to the 70186128f7SMauro Carvalho Chehabsystem call. To make sure that userspace programs can safely use flags 71186128f7SMauro Carvalho Chehabbetween kernel versions, check whether the flags value holds any unknown 72186128f7SMauro Carvalho Chehabflags, and reject the system call (with ``EINVAL``) if it does:: 73186128f7SMauro Carvalho Chehab 74186128f7SMauro Carvalho Chehab if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) 75186128f7SMauro Carvalho Chehab return -EINVAL; 76186128f7SMauro Carvalho Chehab 77186128f7SMauro Carvalho Chehab(If no flags values are used yet, check that the flags argument is zero.) 78186128f7SMauro Carvalho Chehab 79186128f7SMauro Carvalho ChehabFor more sophisticated system calls that involve a larger number of arguments, 80186128f7SMauro Carvalho Chehabit's preferred to encapsulate the majority of the arguments into a structure 81186128f7SMauro Carvalho Chehabthat is passed in by pointer. Such a structure can cope with future extension 82186128f7SMauro Carvalho Chehabby including a size argument in the structure:: 83186128f7SMauro Carvalho Chehab 84186128f7SMauro Carvalho Chehab struct xyzzy_params { 85186128f7SMauro Carvalho Chehab u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */ 86186128f7SMauro Carvalho Chehab u32 param_1; 87186128f7SMauro Carvalho Chehab u64 param_2; 88186128f7SMauro Carvalho Chehab u64 param_3; 89186128f7SMauro Carvalho Chehab }; 90186128f7SMauro Carvalho Chehab 91186128f7SMauro Carvalho ChehabAs long as any subsequently added field, say ``param_4``, is designed so that a 92186128f7SMauro Carvalho Chehabzero value gives the previous behaviour, then this allows both directions of 93186128f7SMauro Carvalho Chehabversion mismatch: 94186128f7SMauro Carvalho Chehab 95186128f7SMauro Carvalho Chehab - To cope with a later userspace program calling an older kernel, the kernel 96186128f7SMauro Carvalho Chehab code should check that any memory beyond the size of the structure that it 97186128f7SMauro Carvalho Chehab expects is zero (effectively checking that ``param_4 == 0``). 98186128f7SMauro Carvalho Chehab - To cope with an older userspace program calling a newer kernel, the kernel 99186128f7SMauro Carvalho Chehab code can zero-extend a smaller instance of the structure (effectively 100186128f7SMauro Carvalho Chehab setting ``param_4 = 0``). 101186128f7SMauro Carvalho Chehab 102186128f7SMauro Carvalho ChehabSee :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in 103186128f7SMauro Carvalho Chehab``kernel/events/core.c``) for an example of this approach. 104186128f7SMauro Carvalho Chehab 105186128f7SMauro Carvalho Chehab 106186128f7SMauro Carvalho ChehabDesigning the API: Other Considerations 107186128f7SMauro Carvalho Chehab--------------------------------------- 108186128f7SMauro Carvalho Chehab 109186128f7SMauro Carvalho ChehabIf your new system call allows userspace to refer to a kernel object, it 110186128f7SMauro Carvalho Chehabshould use a file descriptor as the handle for that object -- don't invent a 111186128f7SMauro Carvalho Chehabnew type of userspace object handle when the kernel already has mechanisms and 112186128f7SMauro Carvalho Chehabwell-defined semantics for using file descriptors. 113186128f7SMauro Carvalho Chehab 114186128f7SMauro Carvalho ChehabIf your new :manpage:`xyzzy(2)` system call does return a new file descriptor, 115186128f7SMauro Carvalho Chehabthen the flags argument should include a value that is equivalent to setting 116186128f7SMauro Carvalho Chehab``O_CLOEXEC`` on the new FD. This makes it possible for userspace to close 117186128f7SMauro Carvalho Chehabthe timing window between ``xyzzy()`` and calling 118186128f7SMauro Carvalho Chehab``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and 119186128f7SMauro Carvalho Chehab``execve()`` in another thread could leak a descriptor to 120186128f7SMauro Carvalho Chehabthe exec'ed program. (However, resist the temptation to re-use the actual value 121186128f7SMauro Carvalho Chehabof the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a 122186128f7SMauro Carvalho Chehabnumbering space of ``O_*`` flags that is fairly full.) 123186128f7SMauro Carvalho Chehab 124186128f7SMauro Carvalho ChehabIf your system call returns a new file descriptor, you should also consider 125186128f7SMauro Carvalho Chehabwhat it means to use the :manpage:`poll(2)` family of system calls on that file 126186128f7SMauro Carvalho Chehabdescriptor. Making a file descriptor ready for reading or writing is the 127186128f7SMauro Carvalho Chehabnormal way for the kernel to indicate to userspace that an event has 128186128f7SMauro Carvalho Chehaboccurred on the corresponding kernel object. 129186128f7SMauro Carvalho Chehab 130186128f7SMauro Carvalho ChehabIf your new :manpage:`xyzzy(2)` system call involves a filename argument:: 131186128f7SMauro Carvalho Chehab 132186128f7SMauro Carvalho Chehab int sys_xyzzy(const char __user *path, ..., unsigned int flags); 133186128f7SMauro Carvalho Chehab 134186128f7SMauro Carvalho Chehabyou should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate:: 135186128f7SMauro Carvalho Chehab 136186128f7SMauro Carvalho Chehab int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags); 137186128f7SMauro Carvalho Chehab 138186128f7SMauro Carvalho ChehabThis allows more flexibility for how userspace specifies the file in question; 139186128f7SMauro Carvalho Chehabin particular it allows userspace to request the functionality for an 140186128f7SMauro Carvalho Chehabalready-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively 141186128f7SMauro Carvalho Chehabgiving an :manpage:`fxyzzy(3)` operation for free:: 142186128f7SMauro Carvalho Chehab 143186128f7SMauro Carvalho Chehab - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...) 144186128f7SMauro Carvalho Chehab - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...) 145186128f7SMauro Carvalho Chehab 146186128f7SMauro Carvalho Chehab(For more details on the rationale of the \*at() calls, see the 147186128f7SMauro Carvalho Chehab:manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the 148186128f7SMauro Carvalho Chehab:manpage:`fstatat(2)` man page.) 149186128f7SMauro Carvalho Chehab 150186128f7SMauro Carvalho ChehabIf your new :manpage:`xyzzy(2)` system call involves a parameter describing an 151186128f7SMauro Carvalho Chehaboffset within a file, make its type ``loff_t`` so that 64-bit offsets can be 152186128f7SMauro Carvalho Chehabsupported even on 32-bit architectures. 153186128f7SMauro Carvalho Chehab 154186128f7SMauro Carvalho ChehabIf your new :manpage:`xyzzy(2)` system call involves privileged functionality, 155186128f7SMauro Carvalho Chehabit needs to be governed by the appropriate Linux capability bit (checked with 156186128f7SMauro Carvalho Chehaba call to ``capable()``), as described in the :manpage:`capabilities(7)` man 157186128f7SMauro Carvalho Chehabpage. Choose an existing capability bit that governs related functionality, 158186128f7SMauro Carvalho Chehabbut try to avoid combining lots of only vaguely related functions together 159186128f7SMauro Carvalho Chehabunder the same bit, as this goes against capabilities' purpose of splitting 160186128f7SMauro Carvalho Chehabthe power of root. In particular, avoid adding new uses of the already 161186128f7SMauro Carvalho Chehaboverly-general ``CAP_SYS_ADMIN`` capability. 162186128f7SMauro Carvalho Chehab 163186128f7SMauro Carvalho ChehabIf your new :manpage:`xyzzy(2)` system call manipulates a process other than 164186128f7SMauro Carvalho Chehabthe calling process, it should be restricted (using a call to 165186128f7SMauro Carvalho Chehab``ptrace_may_access()``) so that only a calling process with the same 166186128f7SMauro Carvalho Chehabpermissions as the target process, or with the necessary capabilities, can 167186128f7SMauro Carvalho Chehabmanipulate the target process. 168186128f7SMauro Carvalho Chehab 169186128f7SMauro Carvalho ChehabFinally, be aware that some non-x86 architectures have an easier time if 170186128f7SMauro Carvalho Chehabsystem call parameters that are explicitly 64-bit fall on odd-numbered 171186128f7SMauro Carvalho Chehabarguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit 172186128f7SMauro Carvalho Chehabregisters. (This concern does not apply if the arguments are part of a 173186128f7SMauro Carvalho Chehabstructure that's passed in by pointer.) 174186128f7SMauro Carvalho Chehab 175186128f7SMauro Carvalho Chehab 176186128f7SMauro Carvalho ChehabProposing the API 177186128f7SMauro Carvalho Chehab----------------- 178186128f7SMauro Carvalho Chehab 179186128f7SMauro Carvalho ChehabTo make new system calls easy to review, it's best to divide up the patchset 180186128f7SMauro Carvalho Chehabinto separate chunks. These should include at least the following items as 181186128f7SMauro Carvalho Chehabdistinct commits (each of which is described further below): 182186128f7SMauro Carvalho Chehab 183186128f7SMauro Carvalho Chehab - The core implementation of the system call, together with prototypes, 184186128f7SMauro Carvalho Chehab generic numbering, Kconfig changes and fallback stub implementation. 185186128f7SMauro Carvalho Chehab - Wiring up of the new system call for one particular architecture, usually 186186128f7SMauro Carvalho Chehab x86 (including all of x86_64, x86_32 and x32). 187186128f7SMauro Carvalho Chehab - A demonstration of the use of the new system call in userspace via a 188186128f7SMauro Carvalho Chehab selftest in ``tools/testing/selftests/``. 189186128f7SMauro Carvalho Chehab - A draft man-page for the new system call, either as plain text in the 190186128f7SMauro Carvalho Chehab cover letter, or as a patch to the (separate) man-pages repository. 191186128f7SMauro Carvalho Chehab 192186128f7SMauro Carvalho ChehabNew system call proposals, like any change to the kernel's API, should always 193186128f7SMauro Carvalho Chehabbe cc'ed to [email protected]. 194186128f7SMauro Carvalho Chehab 195186128f7SMauro Carvalho Chehab 196186128f7SMauro Carvalho ChehabGeneric System Call Implementation 197186128f7SMauro Carvalho Chehab---------------------------------- 198186128f7SMauro Carvalho Chehab 199186128f7SMauro Carvalho ChehabThe main entry point for your new :manpage:`xyzzy(2)` system call will be called 200186128f7SMauro Carvalho Chehab``sys_xyzzy()``, but you add this entry point with the appropriate 201186128f7SMauro Carvalho Chehab``SYSCALL_DEFINEn()`` macro rather than explicitly. The 'n' indicates the 202186128f7SMauro Carvalho Chehabnumber of arguments to the system call, and the macro takes the system call name 203186128f7SMauro Carvalho Chehabfollowed by the (type, name) pairs for the parameters as arguments. Using 204186128f7SMauro Carvalho Chehabthis macro allows metadata about the new system call to be made available for 205186128f7SMauro Carvalho Chehabother tools. 206186128f7SMauro Carvalho Chehab 207186128f7SMauro Carvalho ChehabThe new entry point also needs a corresponding function prototype, in 208186128f7SMauro Carvalho Chehab``include/linux/syscalls.h``, marked as asmlinkage to match the way that system 209186128f7SMauro Carvalho Chehabcalls are invoked:: 210186128f7SMauro Carvalho Chehab 211186128f7SMauro Carvalho Chehab asmlinkage long sys_xyzzy(...); 212186128f7SMauro Carvalho Chehab 213186128f7SMauro Carvalho ChehabSome architectures (e.g. x86) have their own architecture-specific syscall 214186128f7SMauro Carvalho Chehabtables, but several other architectures share a generic syscall table. Add your 215186128f7SMauro Carvalho Chehabnew system call to the generic list by adding an entry to the list in 216186128f7SMauro Carvalho Chehab``include/uapi/asm-generic/unistd.h``:: 217186128f7SMauro Carvalho Chehab 218186128f7SMauro Carvalho Chehab #define __NR_xyzzy 292 219186128f7SMauro Carvalho Chehab __SYSCALL(__NR_xyzzy, sys_xyzzy) 220186128f7SMauro Carvalho Chehab 221186128f7SMauro Carvalho ChehabAlso update the __NR_syscalls count to reflect the additional system call, and 222186128f7SMauro Carvalho Chehabnote that if multiple new system calls are added in the same merge window, 223186128f7SMauro Carvalho Chehabyour new syscall number may get adjusted to resolve conflicts. 224186128f7SMauro Carvalho Chehab 225186128f7SMauro Carvalho ChehabThe file ``kernel/sys_ni.c`` provides a fallback stub implementation of each 226186128f7SMauro Carvalho Chehabsystem call, returning ``-ENOSYS``. Add your new system call here too:: 227186128f7SMauro Carvalho Chehab 22867a7acd3SDominik Brodowski COND_SYSCALL(xyzzy); 229186128f7SMauro Carvalho Chehab 230186128f7SMauro Carvalho ChehabYour new kernel functionality, and the system call that controls it, should 231186128f7SMauro Carvalho Chehabnormally be optional, so add a ``CONFIG`` option (typically to 232186128f7SMauro Carvalho Chehab``init/Kconfig``) for it. As usual for new ``CONFIG`` options: 233186128f7SMauro Carvalho Chehab 234186128f7SMauro Carvalho Chehab - Include a description of the new functionality and system call controlled 235186128f7SMauro Carvalho Chehab by the option. 236186128f7SMauro Carvalho Chehab - Make the option depend on EXPERT if it should be hidden from normal users. 237186128f7SMauro Carvalho Chehab - Make any new source files implementing the function dependent on the CONFIG 238418ca3deSGuillaume Dore option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.o``). 239186128f7SMauro Carvalho Chehab - Double check that the kernel still builds with the new CONFIG option turned 240186128f7SMauro Carvalho Chehab off. 241186128f7SMauro Carvalho Chehab 242186128f7SMauro Carvalho ChehabTo summarize, you need a commit that includes: 243186128f7SMauro Carvalho Chehab 244186128f7SMauro Carvalho Chehab - ``CONFIG`` option for the new function, normally in ``init/Kconfig`` 245186128f7SMauro Carvalho Chehab - ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point 246186128f7SMauro Carvalho Chehab - corresponding prototype in ``include/linux/syscalls.h`` 247186128f7SMauro Carvalho Chehab - generic table entry in ``include/uapi/asm-generic/unistd.h`` 248186128f7SMauro Carvalho Chehab - fallback stub in ``kernel/sys_ni.c`` 249186128f7SMauro Carvalho Chehab 250186128f7SMauro Carvalho Chehab 251186128f7SMauro Carvalho Chehabx86 System Call Implementation 252186128f7SMauro Carvalho Chehab------------------------------ 253186128f7SMauro Carvalho Chehab 254186128f7SMauro Carvalho ChehabTo wire up your new system call for x86 platforms, you need to update the 255186128f7SMauro Carvalho Chehabmaster syscall tables. Assuming your new system call isn't special in some 256186128f7SMauro Carvalho Chehabway (see below), this involves a "common" entry (for x86_64 and x32) in 257186128f7SMauro Carvalho Chehabarch/x86/entry/syscalls/syscall_64.tbl:: 258186128f7SMauro Carvalho Chehab 259186128f7SMauro Carvalho Chehab 333 common xyzzy sys_xyzzy 260186128f7SMauro Carvalho Chehab 261186128f7SMauro Carvalho Chehaband an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``:: 262186128f7SMauro Carvalho Chehab 263186128f7SMauro Carvalho Chehab 380 i386 xyzzy sys_xyzzy 264186128f7SMauro Carvalho Chehab 265186128f7SMauro Carvalho ChehabAgain, these numbers are liable to be changed if there are conflicts in the 266186128f7SMauro Carvalho Chehabrelevant merge window. 267186128f7SMauro Carvalho Chehab 268186128f7SMauro Carvalho Chehab 269186128f7SMauro Carvalho ChehabCompatibility System Calls (Generic) 270186128f7SMauro Carvalho Chehab------------------------------------ 271186128f7SMauro Carvalho Chehab 272186128f7SMauro Carvalho ChehabFor most system calls the same 64-bit implementation can be invoked even when 273186128f7SMauro Carvalho Chehabthe userspace program is itself 32-bit; even if the system call's parameters 274186128f7SMauro Carvalho Chehabinclude an explicit pointer, this is handled transparently. 275186128f7SMauro Carvalho Chehab 276186128f7SMauro Carvalho ChehabHowever, there are a couple of situations where a compatibility layer is 277186128f7SMauro Carvalho Chehabneeded to cope with size differences between 32-bit and 64-bit. 278186128f7SMauro Carvalho Chehab 279186128f7SMauro Carvalho ChehabThe first is if the 64-bit kernel also supports 32-bit userspace programs, and 280186128f7SMauro Carvalho Chehabso needs to parse areas of (``__user``) memory that could hold either 32-bit or 281186128f7SMauro Carvalho Chehab64-bit values. In particular, this is needed whenever a system call argument 282186128f7SMauro Carvalho Chehabis: 283186128f7SMauro Carvalho Chehab 284186128f7SMauro Carvalho Chehab - a pointer to a pointer 285186128f7SMauro Carvalho Chehab - a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``) 286186128f7SMauro Carvalho Chehab - a pointer to a varying sized integral type (``time_t``, ``off_t``, 287186128f7SMauro Carvalho Chehab ``long``, ...) 288186128f7SMauro Carvalho Chehab - a pointer to a struct containing a varying sized integral type. 289186128f7SMauro Carvalho Chehab 290186128f7SMauro Carvalho ChehabThe second situation that requires a compatibility layer is if one of the 291186128f7SMauro Carvalho Chehabsystem call's arguments has a type that is explicitly 64-bit even on a 32-bit 292186128f7SMauro Carvalho Chehabarchitecture, for example ``loff_t`` or ``__u64``. In this case, a value that 293186128f7SMauro Carvalho Chehabarrives at a 64-bit kernel from a 32-bit application will be split into two 294186128f7SMauro Carvalho Chehab32-bit values, which then need to be re-assembled in the compatibility layer. 295186128f7SMauro Carvalho Chehab 296186128f7SMauro Carvalho Chehab(Note that a system call argument that's a pointer to an explicit 64-bit type 297186128f7SMauro Carvalho Chehabdoes **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of 298186128f7SMauro Carvalho Chehabtype ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.) 299186128f7SMauro Carvalho Chehab 300186128f7SMauro Carvalho ChehabThe compatibility version of the system call is called ``compat_sys_xyzzy()``, 301186128f7SMauro Carvalho Chehaband is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to 302186128f7SMauro Carvalho ChehabSYSCALL_DEFINEn. This version of the implementation runs as part of a 64-bit 303186128f7SMauro Carvalho Chehabkernel, but expects to receive 32-bit parameter values and does whatever is 304186128f7SMauro Carvalho Chehabneeded to deal with them. (Typically, the ``compat_sys_`` version converts the 305186128f7SMauro Carvalho Chehabvalues to 64-bit versions and either calls on to the ``sys_`` version, or both of 306186128f7SMauro Carvalho Chehabthem call a common inner implementation function.) 307186128f7SMauro Carvalho Chehab 308186128f7SMauro Carvalho ChehabThe compat entry point also needs a corresponding function prototype, in 309186128f7SMauro Carvalho Chehab``include/linux/compat.h``, marked as asmlinkage to match the way that system 310186128f7SMauro Carvalho Chehabcalls are invoked:: 311186128f7SMauro Carvalho Chehab 312186128f7SMauro Carvalho Chehab asmlinkage long compat_sys_xyzzy(...); 313186128f7SMauro Carvalho Chehab 314186128f7SMauro Carvalho ChehabIf the system call involves a structure that is laid out differently on 32-bit 315186128f7SMauro Carvalho Chehaband 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h 316186128f7SMauro Carvalho Chehabheader file should also include a compat version of the structure (``struct 317186128f7SMauro Carvalho Chehabcompat_xyzzy_args``) where each variable-size field has the appropriate 318186128f7SMauro Carvalho Chehab``compat_`` type that corresponds to the type in ``struct xyzzy_args``. The 319186128f7SMauro Carvalho Chehab``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to 320186128f7SMauro Carvalho Chehabparse the arguments from a 32-bit invocation. 321186128f7SMauro Carvalho Chehab 322186128f7SMauro Carvalho ChehabFor example, if there are fields:: 323186128f7SMauro Carvalho Chehab 324186128f7SMauro Carvalho Chehab struct xyzzy_args { 325186128f7SMauro Carvalho Chehab const char __user *ptr; 326186128f7SMauro Carvalho Chehab __kernel_long_t varying_val; 327186128f7SMauro Carvalho Chehab u64 fixed_val; 328186128f7SMauro Carvalho Chehab /* ... */ 329186128f7SMauro Carvalho Chehab }; 330186128f7SMauro Carvalho Chehab 331186128f7SMauro Carvalho Chehabin struct xyzzy_args, then struct compat_xyzzy_args would have:: 332186128f7SMauro Carvalho Chehab 333186128f7SMauro Carvalho Chehab struct compat_xyzzy_args { 334186128f7SMauro Carvalho Chehab compat_uptr_t ptr; 335186128f7SMauro Carvalho Chehab compat_long_t varying_val; 336186128f7SMauro Carvalho Chehab u64 fixed_val; 337186128f7SMauro Carvalho Chehab /* ... */ 338186128f7SMauro Carvalho Chehab }; 339186128f7SMauro Carvalho Chehab 340186128f7SMauro Carvalho ChehabThe generic system call list also needs adjusting to allow for the compat 341186128f7SMauro Carvalho Chehabversion; the entry in ``include/uapi/asm-generic/unistd.h`` should use 342186128f7SMauro Carvalho Chehab``__SC_COMP`` rather than ``__SYSCALL``:: 343186128f7SMauro Carvalho Chehab 344186128f7SMauro Carvalho Chehab #define __NR_xyzzy 292 345186128f7SMauro Carvalho Chehab __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy) 346186128f7SMauro Carvalho Chehab 347186128f7SMauro Carvalho ChehabTo summarize, you need: 348186128f7SMauro Carvalho Chehab 349186128f7SMauro Carvalho Chehab - a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point 350186128f7SMauro Carvalho Chehab - corresponding prototype in ``include/linux/compat.h`` 351186128f7SMauro Carvalho Chehab - (if needed) 32-bit mapping struct in ``include/linux/compat.h`` 352186128f7SMauro Carvalho Chehab - instance of ``__SC_COMP`` not ``__SYSCALL`` in 353186128f7SMauro Carvalho Chehab ``include/uapi/asm-generic/unistd.h`` 354186128f7SMauro Carvalho Chehab 355186128f7SMauro Carvalho Chehab 356186128f7SMauro Carvalho ChehabCompatibility System Calls (x86) 357186128f7SMauro Carvalho Chehab-------------------------------- 358186128f7SMauro Carvalho Chehab 359186128f7SMauro Carvalho ChehabTo wire up the x86 architecture of a system call with a compatibility version, 360186128f7SMauro Carvalho Chehabthe entries in the syscall tables need to be adjusted. 361186128f7SMauro Carvalho Chehab 362186128f7SMauro Carvalho ChehabFirst, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra 363186128f7SMauro Carvalho Chehabcolumn to indicate that a 32-bit userspace program running on a 64-bit kernel 364186128f7SMauro Carvalho Chehabshould hit the compat entry point:: 365186128f7SMauro Carvalho Chehab 3665ac9efa3SDominik Brodowski 380 i386 xyzzy sys_xyzzy __ia32_compat_sys_xyzzy 367186128f7SMauro Carvalho Chehab 368186128f7SMauro Carvalho ChehabSecond, you need to figure out what should happen for the x32 ABI version of 369186128f7SMauro Carvalho Chehabthe new system call. There's a choice here: the layout of the arguments 370186128f7SMauro Carvalho Chehabshould either match the 64-bit version or the 32-bit version. 371186128f7SMauro Carvalho Chehab 372186128f7SMauro Carvalho ChehabIf there's a pointer-to-a-pointer involved, the decision is easy: x32 is 373186128f7SMauro Carvalho ChehabILP32, so the layout should match the 32-bit version, and the entry in 374186128f7SMauro Carvalho Chehab``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit 375186128f7SMauro Carvalho Chehabthe compatibility wrapper:: 376186128f7SMauro Carvalho Chehab 377186128f7SMauro Carvalho Chehab 333 64 xyzzy sys_xyzzy 378186128f7SMauro Carvalho Chehab ... 3795ac9efa3SDominik Brodowski 555 x32 xyzzy __x32_compat_sys_xyzzy 380186128f7SMauro Carvalho Chehab 381186128f7SMauro Carvalho ChehabIf no pointers are involved, then it is preferable to re-use the 64-bit system 382186128f7SMauro Carvalho Chehabcall for the x32 ABI (and consequently the entry in 383186128f7SMauro Carvalho Chehabarch/x86/entry/syscalls/syscall_64.tbl is unchanged). 384186128f7SMauro Carvalho Chehab 385186128f7SMauro Carvalho ChehabIn either case, you should check that the types involved in your argument 386186128f7SMauro Carvalho Chehablayout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or 387186128f7SMauro Carvalho Chehab64-bit (-m64) equivalents. 388186128f7SMauro Carvalho Chehab 389186128f7SMauro Carvalho Chehab 390186128f7SMauro Carvalho ChehabSystem Calls Returning Elsewhere 391186128f7SMauro Carvalho Chehab-------------------------------- 392186128f7SMauro Carvalho Chehab 393186128f7SMauro Carvalho ChehabFor most system calls, once the system call is complete the user program 394186128f7SMauro Carvalho Chehabcontinues exactly where it left off -- at the next instruction, with the 395186128f7SMauro Carvalho Chehabstack the same and most of the registers the same as before the system call, 396186128f7SMauro Carvalho Chehaband with the same virtual memory space. 397186128f7SMauro Carvalho Chehab 398186128f7SMauro Carvalho ChehabHowever, a few system calls do things differently. They might return to a 399186128f7SMauro Carvalho Chehabdifferent location (``rt_sigreturn``) or change the memory space 400186128f7SMauro Carvalho Chehab(``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``) 401186128f7SMauro Carvalho Chehabof the program. 402186128f7SMauro Carvalho Chehab 403186128f7SMauro Carvalho ChehabTo allow for this, the kernel implementation of the system call may need to 404186128f7SMauro Carvalho Chehabsave and restore additional registers to the kernel stack, allowing complete 405186128f7SMauro Carvalho Chehabcontrol of where and how execution continues after the system call. 406186128f7SMauro Carvalho Chehab 407186128f7SMauro Carvalho ChehabThis is arch-specific, but typically involves defining assembly entry points 408186128f7SMauro Carvalho Chehabthat save/restore additional registers and invoke the real system call entry 409186128f7SMauro Carvalho Chehabpoint. 410186128f7SMauro Carvalho Chehab 411186128f7SMauro Carvalho ChehabFor x86_64, this is implemented as a ``stub_xyzzy`` entry point in 412186128f7SMauro Carvalho Chehab``arch/x86/entry/entry_64.S``, and the entry in the syscall table 413186128f7SMauro Carvalho Chehab(``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match:: 414186128f7SMauro Carvalho Chehab 415186128f7SMauro Carvalho Chehab 333 common xyzzy stub_xyzzy 416186128f7SMauro Carvalho Chehab 417186128f7SMauro Carvalho ChehabThe equivalent for 32-bit programs running on a 64-bit kernel is normally 418186128f7SMauro Carvalho Chehabcalled ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``, 419186128f7SMauro Carvalho Chehabwith the corresponding syscall table adjustment in 420186128f7SMauro Carvalho Chehab``arch/x86/entry/syscalls/syscall_32.tbl``:: 421186128f7SMauro Carvalho Chehab 422186128f7SMauro Carvalho Chehab 380 i386 xyzzy sys_xyzzy stub32_xyzzy 423186128f7SMauro Carvalho Chehab 424186128f7SMauro Carvalho ChehabIf the system call needs a compatibility layer (as in the previous section) 425186128f7SMauro Carvalho Chehabthen the ``stub32_`` version needs to call on to the ``compat_sys_`` version 426186128f7SMauro Carvalho Chehabof the system call rather than the native 64-bit version. Also, if the x32 ABI 427186128f7SMauro Carvalho Chehabimplementation is not common with the x86_64 version, then its syscall 428186128f7SMauro Carvalho Chehabtable will also need to invoke a stub that calls on to the ``compat_sys_`` 429186128f7SMauro Carvalho Chehabversion. 430186128f7SMauro Carvalho Chehab 431186128f7SMauro Carvalho ChehabFor completeness, it's also nice to set up a mapping so that user-mode Linux 432186128f7SMauro Carvalho Chehabstill works -- its syscall table will reference stub_xyzzy, but the UML build 433186128f7SMauro Carvalho Chehabdoesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML 434186128f7SMauro Carvalho Chehabsimulates registers etc). Fixing this is as simple as adding a #define to 435186128f7SMauro Carvalho Chehab``arch/x86/um/sys_call_table_64.c``:: 436186128f7SMauro Carvalho Chehab 437186128f7SMauro Carvalho Chehab #define stub_xyzzy sys_xyzzy 438186128f7SMauro Carvalho Chehab 439186128f7SMauro Carvalho Chehab 440186128f7SMauro Carvalho ChehabOther Details 441186128f7SMauro Carvalho Chehab------------- 442186128f7SMauro Carvalho Chehab 443186128f7SMauro Carvalho ChehabMost of the kernel treats system calls in a generic way, but there is the 444186128f7SMauro Carvalho Chehaboccasional exception that may need updating for your particular system call. 445186128f7SMauro Carvalho Chehab 446186128f7SMauro Carvalho ChehabThe audit subsystem is one such special case; it includes (arch-specific) 447186128f7SMauro Carvalho Chehabfunctions that classify some special types of system call -- specifically 448186128f7SMauro Carvalho Chehabfile open (``open``/``openat``), program execution (``execve``/``exeveat``) or 449186128f7SMauro Carvalho Chehabsocket multiplexor (``socketcall``) operations. If your new system call is 450186128f7SMauro Carvalho Chehabanalogous to one of these, then the audit system should be updated. 451186128f7SMauro Carvalho Chehab 452186128f7SMauro Carvalho ChehabMore generally, if there is an existing system call that is analogous to your 453186128f7SMauro Carvalho Chehabnew system call, it's worth doing a kernel-wide grep for the existing system 454186128f7SMauro Carvalho Chehabcall to check there are no other special cases. 455186128f7SMauro Carvalho Chehab 456186128f7SMauro Carvalho Chehab 457186128f7SMauro Carvalho ChehabTesting 458186128f7SMauro Carvalho Chehab------- 459186128f7SMauro Carvalho Chehab 460186128f7SMauro Carvalho ChehabA new system call should obviously be tested; it is also useful to provide 461186128f7SMauro Carvalho Chehabreviewers with a demonstration of how user space programs will use the system 462186128f7SMauro Carvalho Chehabcall. A good way to combine these aims is to include a simple self-test 463186128f7SMauro Carvalho Chehabprogram in a new directory under ``tools/testing/selftests/``. 464186128f7SMauro Carvalho Chehab 465186128f7SMauro Carvalho ChehabFor a new system call, there will obviously be no libc wrapper function and so 466186128f7SMauro Carvalho Chehabthe test will need to invoke it using ``syscall()``; also, if the system call 467186128f7SMauro Carvalho Chehabinvolves a new userspace-visible structure, the corresponding header will need 468186128f7SMauro Carvalho Chehabto be installed to compile the test. 469186128f7SMauro Carvalho Chehab 470186128f7SMauro Carvalho ChehabMake sure the selftest runs successfully on all supported architectures. For 471186128f7SMauro Carvalho Chehabexample, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32) 472186128f7SMauro Carvalho Chehaband x32 (-mx32) ABI program. 473186128f7SMauro Carvalho Chehab 474186128f7SMauro Carvalho ChehabFor more extensive and thorough testing of new functionality, you should also 475186128f7SMauro Carvalho Chehabconsider adding tests to the Linux Test Project, or to the xfstests project 476186128f7SMauro Carvalho Chehabfor filesystem-related changes. 477186128f7SMauro Carvalho Chehab 478186128f7SMauro Carvalho Chehab - https://linux-test-project.github.io/ 479186128f7SMauro Carvalho Chehab - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git 480186128f7SMauro Carvalho Chehab 481186128f7SMauro Carvalho Chehab 482186128f7SMauro Carvalho ChehabMan Page 483186128f7SMauro Carvalho Chehab-------- 484186128f7SMauro Carvalho Chehab 485186128f7SMauro Carvalho ChehabAll new system calls should come with a complete man page, ideally using groff 486186128f7SMauro Carvalho Chehabmarkup, but plain text will do. If groff is used, it's helpful to include a 487186128f7SMauro Carvalho Chehabpre-rendered ASCII version of the man page in the cover email for the 488186128f7SMauro Carvalho Chehabpatchset, for the convenience of reviewers. 489186128f7SMauro Carvalho Chehab 490186128f7SMauro Carvalho ChehabThe man page should be cc'ed to [email protected] 491186128f7SMauro Carvalho ChehabFor more details, see https://www.kernel.org/doc/man-pages/patches.html 492186128f7SMauro Carvalho Chehab 493819671ffSDominik Brodowski 494819671ffSDominik BrodowskiDo not call System Calls in the Kernel 495819671ffSDominik Brodowski-------------------------------------- 496819671ffSDominik Brodowski 497819671ffSDominik BrodowskiSystem calls are, as stated above, interaction points between userspace and 498819671ffSDominik Brodowskithe kernel. Therefore, system call functions such as ``sys_xyzzy()`` or 499819671ffSDominik Brodowski``compat_sys_xyzzy()`` should only be called from userspace via the syscall 500819671ffSDominik Brodowskitable, but not from elsewhere in the kernel. If the syscall functionality is 501819671ffSDominik Brodowskiuseful to be used within the kernel, needs to be shared between an old and a 502819671ffSDominik Brodowskinew syscall, or needs to be shared between a syscall and its compatibility 503819671ffSDominik Brodowskivariant, it should be implemented by means of a "helper" function (such as 504*dd58e649SAndré Almeida``ksys_xyzzy()``). This kernel function may then be called within the 505819671ffSDominik Brodowskisyscall stub (``sys_xyzzy()``), the compatibility syscall stub 506819671ffSDominik Brodowski(``compat_sys_xyzzy()``), and/or other kernel code. 507819671ffSDominik Brodowski 508819671ffSDominik BrodowskiAt least on 64-bit x86, it will be a hard requirement from v4.17 onwards to not 509819671ffSDominik Brodowskicall system call functions in the kernel. It uses a different calling 510819671ffSDominik Brodowskiconvention for system calls where ``struct pt_regs`` is decoded on-the-fly in a 511819671ffSDominik Brodowskisyscall wrapper which then hands processing over to the actual syscall function. 512819671ffSDominik BrodowskiThis means that only those parameters which are actually needed for a specific 513819671ffSDominik Brodowskisyscall are passed on during syscall entry, instead of filling in six CPU 514819671ffSDominik Brodowskiregisters with random user space content all the time (which may cause serious 515819671ffSDominik Brodowskitrouble down the call chain). 516819671ffSDominik Brodowski 517819671ffSDominik BrodowskiMoreover, rules on how data may be accessed may differ between kernel data and 518819671ffSDominik Brodowskiuser data. This is another reason why calling ``sys_xyzzy()`` is generally a 519819671ffSDominik Brodowskibad idea. 520819671ffSDominik Brodowski 521819671ffSDominik BrodowskiExceptions to this rule are only allowed in architecture-specific overrides, 522819671ffSDominik Brodowskiarchitecture-specific compatibility wrappers, or other code in arch/. 523819671ffSDominik Brodowski 524819671ffSDominik Brodowski 525186128f7SMauro Carvalho ChehabReferences and Sources 526186128f7SMauro Carvalho Chehab---------------------- 527186128f7SMauro Carvalho Chehab 528186128f7SMauro Carvalho Chehab - LWN article from Michael Kerrisk on use of flags argument in system calls: 529186128f7SMauro Carvalho Chehab https://lwn.net/Articles/585415/ 530186128f7SMauro Carvalho Chehab - LWN article from Michael Kerrisk on how to handle unknown flags in a system 531186128f7SMauro Carvalho Chehab call: https://lwn.net/Articles/588444/ 532186128f7SMauro Carvalho Chehab - LWN article from Jake Edge describing constraints on 64-bit system call 533186128f7SMauro Carvalho Chehab arguments: https://lwn.net/Articles/311630/ 534186128f7SMauro Carvalho Chehab - Pair of LWN articles from David Drysdale that describe the system call 535186128f7SMauro Carvalho Chehab implementation paths in detail for v3.14: 536186128f7SMauro Carvalho Chehab 537186128f7SMauro Carvalho Chehab - https://lwn.net/Articles/604287/ 538186128f7SMauro Carvalho Chehab - https://lwn.net/Articles/604515/ 539186128f7SMauro Carvalho Chehab 540186128f7SMauro Carvalho Chehab - Architecture-specific requirements for system calls are discussed in the 541186128f7SMauro Carvalho Chehab :manpage:`syscall(2)` man-page: 542186128f7SMauro Carvalho Chehab http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES 543186128f7SMauro Carvalho Chehab - Collated emails from Linus Torvalds discussing the problems with ``ioctl()``: 54493431e06SAlexander A. Klimov https://yarchive.net/comp/linux/ioctl.html 545186128f7SMauro Carvalho Chehab - "How to not invent kernel interfaces", Arnd Bergmann, 54693431e06SAlexander A. Klimov https://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf 547186128f7SMauro Carvalho Chehab - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN: 548186128f7SMauro Carvalho Chehab https://lwn.net/Articles/486306/ 549186128f7SMauro Carvalho Chehab - Recommendation from Andrew Morton that all related information for a new 550186128f7SMauro Carvalho Chehab system call should come in the same email thread: 55105a5f51cSJoe Perches https://lore.kernel.org/r/[email protected] 552186128f7SMauro Carvalho Chehab - Recommendation from Michael Kerrisk that a new system call should come with 55305a5f51cSJoe Perches a man page: https://lore.kernel.org/r/CAKgNAkgMA39AfoSoA5Pe1r9N+ZzfYQNvNPvcRN7tOvRb8+v06Q@mail.gmail.com 554186128f7SMauro Carvalho Chehab - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate 55505a5f51cSJoe Perches commit: https://lore.kernel.org/r/alpine.DEB.2.11.1411191249560.3909@nanos 556186128f7SMauro Carvalho Chehab - Suggestion from Greg Kroah-Hartman that it's good for new system calls to 55705a5f51cSJoe Perches come with a man-page & selftest: https://lore.kernel.org/r/[email protected] 558186128f7SMauro Carvalho Chehab - Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension: 55905a5f51cSJoe Perches https://lore.kernel.org/r/CAHO5Pa3F2MjfTtfNxa8LbnkeeU8=YJ+9tDqxZpw7Gz59E-4AUg@mail.gmail.com 560186128f7SMauro Carvalho Chehab - Suggestion from Ingo Molnar that system calls that involve multiple 561186128f7SMauro Carvalho Chehab arguments should encapsulate those arguments in a struct, which includes a 56205a5f51cSJoe Perches size field for future extensibility: https://lore.kernel.org/r/[email protected] 563186128f7SMauro Carvalho Chehab - Numbering oddities arising from (re-)use of O_* numbering space flags: 564186128f7SMauro Carvalho Chehab 565186128f7SMauro Carvalho Chehab - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness 566186128f7SMauro Carvalho Chehab check") 567186128f7SMauro Carvalho Chehab - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc 568186128f7SMauro Carvalho Chehab conflict") 569186128f7SMauro Carvalho Chehab - commit bb458c644a59 ("Safer ABI for O_TMPFILE") 570186128f7SMauro Carvalho Chehab 571186128f7SMauro Carvalho Chehab - Discussion from Matthew Wilcox about restrictions on 64-bit arguments: 57205a5f51cSJoe Perches https://lore.kernel.org/r/[email protected] 573186128f7SMauro Carvalho Chehab - Recommendation from Greg Kroah-Hartman that unknown flags should be 57405a5f51cSJoe Perches policed: https://lore.kernel.org/r/[email protected] 575186128f7SMauro Carvalho Chehab - Recommendation from Linus Torvalds that x32 system calls should prefer 576186128f7SMauro Carvalho Chehab compatibility with 64-bit versions rather than 32-bit versions: 57705a5f51cSJoe Perches https://lore.kernel.org/r/CA+55aFxfmwfB7jbbrXxa=K7VBYPfAvmu3XOkGrLbB1UFjX1+Ew@mail.gmail.com 578