1*f5461124SRandy Dunlap==============================
2*f5461124SRandy DunlapGeneral notification mechanism
3*f5461124SRandy Dunlap==============================
4*f5461124SRandy Dunlap
5*f5461124SRandy DunlapThe general notification mechanism is built on top of the standard pipe driver
6*f5461124SRandy Dunlapwhereby it effectively splices notification messages from the kernel into pipes
7*f5461124SRandy Dunlapopened by userspace.  This can be used in conjunction with::
8*f5461124SRandy Dunlap
9*f5461124SRandy Dunlap  * Key/keyring notifications
10*f5461124SRandy Dunlap
11*f5461124SRandy Dunlap
12*f5461124SRandy DunlapThe notifications buffers can be enabled by:
13*f5461124SRandy Dunlap
14*f5461124SRandy Dunlap	"General setup"/"General notification queue"
15*f5461124SRandy Dunlap	(CONFIG_WATCH_QUEUE)
16*f5461124SRandy Dunlap
17*f5461124SRandy DunlapThis document has the following sections:
18*f5461124SRandy Dunlap
19*f5461124SRandy Dunlap.. contents:: :local:
20*f5461124SRandy Dunlap
21*f5461124SRandy Dunlap
22*f5461124SRandy DunlapOverview
23*f5461124SRandy Dunlap========
24*f5461124SRandy Dunlap
25*f5461124SRandy DunlapThis facility appears as a pipe that is opened in a special mode.  The pipe's
26*f5461124SRandy Dunlapinternal ring buffer is used to hold messages that are generated by the kernel.
27*f5461124SRandy DunlapThese messages are then read out by read().  Splice and similar are disabled on
28*f5461124SRandy Dunlapsuch pipes due to them wanting to, under some circumstances, revert their
29*f5461124SRandy Dunlapadditions to the ring - which might end up interleaved with notification
30*f5461124SRandy Dunlapmessages.
31*f5461124SRandy Dunlap
32*f5461124SRandy DunlapThe owner of the pipe has to tell the kernel which sources it would like to
33*f5461124SRandy Dunlapwatch through that pipe.  Only sources that have been connected to a pipe will
34*f5461124SRandy Dunlapinsert messages into it.  Note that a source may be bound to multiple pipes and
35*f5461124SRandy Dunlapinsert messages into all of them simultaneously.
36*f5461124SRandy Dunlap
37*f5461124SRandy DunlapFilters may also be emplaced on a pipe so that certain source types and
38*f5461124SRandy Dunlapsubevents can be ignored if they're not of interest.
39*f5461124SRandy Dunlap
40*f5461124SRandy DunlapA message will be discarded if there isn't a slot available in the ring or if
41*f5461124SRandy Dunlapno preallocated message buffer is available.  In both of these cases, read()
42*f5461124SRandy Dunlapwill insert a WATCH_META_LOSS_NOTIFICATION message into the output buffer after
43*f5461124SRandy Dunlapthe last message currently in the buffer has been read.
44*f5461124SRandy Dunlap
45*f5461124SRandy DunlapNote that when producing a notification, the kernel does not wait for the
46*f5461124SRandy Dunlapconsumers to collect it, but rather just continues on.  This means that
47*f5461124SRandy Dunlapnotifications can be generated whilst spinlocks are held and also protects the
48*f5461124SRandy Dunlapkernel from being held up indefinitely by a userspace malfunction.
49*f5461124SRandy Dunlap
50*f5461124SRandy Dunlap
51*f5461124SRandy DunlapMessage Structure
52*f5461124SRandy Dunlap=================
53*f5461124SRandy Dunlap
54*f5461124SRandy DunlapNotification messages begin with a short header::
55*f5461124SRandy Dunlap
56*f5461124SRandy Dunlap	struct watch_notification {
57*f5461124SRandy Dunlap		__u32	type:24;
58*f5461124SRandy Dunlap		__u32	subtype:8;
59*f5461124SRandy Dunlap		__u32	info;
60*f5461124SRandy Dunlap	};
61*f5461124SRandy Dunlap
62*f5461124SRandy Dunlap"type" indicates the source of the notification record and "subtype" indicates
63*f5461124SRandy Dunlapthe type of record from that source (see the Watch Sources section below).  The
64*f5461124SRandy Dunlaptype may also be "WATCH_TYPE_META".  This is a special record type generated
65*f5461124SRandy Dunlapinternally by the watch queue itself.  There are two subtypes:
66*f5461124SRandy Dunlap
67*f5461124SRandy Dunlap  * WATCH_META_REMOVAL_NOTIFICATION
68*f5461124SRandy Dunlap  * WATCH_META_LOSS_NOTIFICATION
69*f5461124SRandy Dunlap
70*f5461124SRandy DunlapThe first indicates that an object on which a watch was installed was removed
71*f5461124SRandy Dunlapor destroyed and the second indicates that some messages have been lost.
72*f5461124SRandy Dunlap
73*f5461124SRandy Dunlap"info" indicates a bunch of things, including:
74*f5461124SRandy Dunlap
75*f5461124SRandy Dunlap  * The length of the message in bytes, including the header (mask with
76*f5461124SRandy Dunlap    WATCH_INFO_LENGTH and shift by WATCH_INFO_LENGTH__SHIFT).  This indicates
77*f5461124SRandy Dunlap    the size of the record, which may be between 8 and 127 bytes.
78*f5461124SRandy Dunlap
79*f5461124SRandy Dunlap  * The watch ID (mask with WATCH_INFO_ID and shift by WATCH_INFO_ID__SHIFT).
80*f5461124SRandy Dunlap    This indicates that caller's ID of the watch, which may be between 0
81*f5461124SRandy Dunlap    and 255.  Multiple watches may share a queue, and this provides a means to
82*f5461124SRandy Dunlap    distinguish them.
83*f5461124SRandy Dunlap
84*f5461124SRandy Dunlap  * A type-specific field (WATCH_INFO_TYPE_INFO).  This is set by the
85*f5461124SRandy Dunlap    notification producer to indicate some meaning specific to the type and
86*f5461124SRandy Dunlap    subtype.
87*f5461124SRandy Dunlap
88*f5461124SRandy DunlapEverything in info apart from the length can be used for filtering.
89*f5461124SRandy Dunlap
90*f5461124SRandy DunlapThe header can be followed by supplementary information.  The format of this is
91*f5461124SRandy Dunlapat the discretion is defined by the type and subtype.
92*f5461124SRandy Dunlap
93*f5461124SRandy Dunlap
94*f5461124SRandy DunlapWatch List (Notification Source) API
95*f5461124SRandy Dunlap====================================
96*f5461124SRandy Dunlap
97*f5461124SRandy DunlapA "watch list" is a list of watchers that are subscribed to a source of
98*f5461124SRandy Dunlapnotifications.  A list may be attached to an object (say a key or a superblock)
99*f5461124SRandy Dunlapor may be global (say for device events).  From a userspace perspective, a
100*f5461124SRandy Dunlapnon-global watch list is typically referred to by reference to the object it
101*f5461124SRandy Dunlapbelongs to (such as using KEYCTL_NOTIFY and giving it a key serial number to
102*f5461124SRandy Dunlapwatch that specific key).
103*f5461124SRandy Dunlap
104*f5461124SRandy DunlapTo manage a watch list, the following functions are provided:
105*f5461124SRandy Dunlap
106*f5461124SRandy Dunlap  * ::
107*f5461124SRandy Dunlap
108*f5461124SRandy Dunlap	void init_watch_list(struct watch_list *wlist,
109*f5461124SRandy Dunlap			     void (*release_watch)(struct watch *wlist));
110*f5461124SRandy Dunlap
111*f5461124SRandy Dunlap    Initialise a watch list.  If ``release_watch`` is not NULL, then this
112*f5461124SRandy Dunlap    indicates a function that should be called when the watch_list object is
113*f5461124SRandy Dunlap    destroyed to discard any references the watch list holds on the watched
114*f5461124SRandy Dunlap    object.
115*f5461124SRandy Dunlap
116*f5461124SRandy Dunlap  * ``void remove_watch_list(struct watch_list *wlist);``
117*f5461124SRandy Dunlap
118*f5461124SRandy Dunlap    This removes all of the watches subscribed to a watch_list and frees them
119*f5461124SRandy Dunlap    and then destroys the watch_list object itself.
120*f5461124SRandy Dunlap
121*f5461124SRandy Dunlap
122*f5461124SRandy DunlapWatch Queue (Notification Output) API
123*f5461124SRandy Dunlap=====================================
124*f5461124SRandy Dunlap
125*f5461124SRandy DunlapA "watch queue" is the buffer allocated by an application that notification
126*f5461124SRandy Dunlaprecords will be written into.  The workings of this are hidden entirely inside
127*f5461124SRandy Dunlapof the pipe device driver, but it is necessary to gain a reference to it to set
128*f5461124SRandy Dunlapa watch.  These can be managed with:
129*f5461124SRandy Dunlap
130*f5461124SRandy Dunlap  * ``struct watch_queue *get_watch_queue(int fd);``
131*f5461124SRandy Dunlap
132*f5461124SRandy Dunlap    Since watch queues are indicated to the kernel by the fd of the pipe that
133*f5461124SRandy Dunlap    implements the buffer, userspace must hand that fd through a system call.
134*f5461124SRandy Dunlap    This can be used to look up an opaque pointer to the watch queue from the
135*f5461124SRandy Dunlap    system call.
136*f5461124SRandy Dunlap
137*f5461124SRandy Dunlap  * ``void put_watch_queue(struct watch_queue *wqueue);``
138*f5461124SRandy Dunlap
139*f5461124SRandy Dunlap    This discards the reference obtained from ``get_watch_queue()``.
140*f5461124SRandy Dunlap
141*f5461124SRandy Dunlap
142*f5461124SRandy DunlapWatch Subscription API
143*f5461124SRandy Dunlap======================
144*f5461124SRandy Dunlap
145*f5461124SRandy DunlapA "watch" is a subscription on a watch list, indicating the watch queue, and
146*f5461124SRandy Dunlapthus the buffer, into which notification records should be written.  The watch
147*f5461124SRandy Dunlapqueue object may also carry filtering rules for that object, as set by
148*f5461124SRandy Dunlapuserspace.  Some parts of the watch struct can be set by the driver::
149*f5461124SRandy Dunlap
150*f5461124SRandy Dunlap	struct watch {
151*f5461124SRandy Dunlap		union {
152*f5461124SRandy Dunlap			u32		info_id;	/* ID to be OR'd in to info field */
153*f5461124SRandy Dunlap			...
154*f5461124SRandy Dunlap		};
155*f5461124SRandy Dunlap		void			*private;	/* Private data for the watched object */
156*f5461124SRandy Dunlap		u64			id;		/* Internal identifier */
157*f5461124SRandy Dunlap		...
158*f5461124SRandy Dunlap	};
159*f5461124SRandy Dunlap
160*f5461124SRandy DunlapThe ``info_id`` value should be an 8-bit number obtained from userspace and
161*f5461124SRandy Dunlapshifted by WATCH_INFO_ID__SHIFT.  This is OR'd into the WATCH_INFO_ID field of
162*f5461124SRandy Dunlapstruct watch_notification::info when and if the notification is written into
163*f5461124SRandy Dunlapthe associated watch queue buffer.
164*f5461124SRandy Dunlap
165*f5461124SRandy DunlapThe ``private`` field is the driver's data associated with the watch_list and
166*f5461124SRandy Dunlapis cleaned up by the ``watch_list::release_watch()`` method.
167*f5461124SRandy Dunlap
168*f5461124SRandy DunlapThe ``id`` field is the source's ID.  Notifications that are posted with a
169*f5461124SRandy Dunlapdifferent ID are ignored.
170*f5461124SRandy Dunlap
171*f5461124SRandy DunlapThe following functions are provided to manage watches:
172*f5461124SRandy Dunlap
173*f5461124SRandy Dunlap  * ``void init_watch(struct watch *watch, struct watch_queue *wqueue);``
174*f5461124SRandy Dunlap
175*f5461124SRandy Dunlap    Initialise a watch object, setting its pointer to the watch queue, using
176*f5461124SRandy Dunlap    appropriate barriering to avoid lockdep complaints.
177*f5461124SRandy Dunlap
178*f5461124SRandy Dunlap  * ``int add_watch_to_object(struct watch *watch, struct watch_list *wlist);``
179*f5461124SRandy Dunlap
180*f5461124SRandy Dunlap    Subscribe a watch to a watch list (notification source).  The
181*f5461124SRandy Dunlap    driver-settable fields in the watch struct must have been set before this
182*f5461124SRandy Dunlap    is called.
183*f5461124SRandy Dunlap
184*f5461124SRandy Dunlap  * ::
185*f5461124SRandy Dunlap
186*f5461124SRandy Dunlap	int remove_watch_from_object(struct watch_list *wlist,
187*f5461124SRandy Dunlap				     struct watch_queue *wqueue,
188*f5461124SRandy Dunlap				     u64 id, false);
189*f5461124SRandy Dunlap
190*f5461124SRandy Dunlap    Remove a watch from a watch list, where the watch must match the specified
191*f5461124SRandy Dunlap    watch queue (``wqueue``) and object identifier (``id``).  A notification
192*f5461124SRandy Dunlap    (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue to
193*f5461124SRandy Dunlap    indicate that the watch got removed.
194*f5461124SRandy Dunlap
195*f5461124SRandy Dunlap  * ``int remove_watch_from_object(struct watch_list *wlist, NULL, 0, true);``
196*f5461124SRandy Dunlap
197*f5461124SRandy Dunlap    Remove all the watches from a watch list.  It is expected that this will be
198*f5461124SRandy Dunlap    called preparatory to destruction and that the watch list will be
199*f5461124SRandy Dunlap    inaccessible to new watches by this point.  A notification
200*f5461124SRandy Dunlap    (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue of each
201*f5461124SRandy Dunlap    subscribed watch to indicate that the watch got removed.
202*f5461124SRandy Dunlap
203*f5461124SRandy Dunlap
204*f5461124SRandy DunlapNotification Posting API
205*f5461124SRandy Dunlap========================
206*f5461124SRandy Dunlap
207*f5461124SRandy DunlapTo post a notification to watch list so that the subscribed watches can see it,
208*f5461124SRandy Dunlapthe following function should be used::
209*f5461124SRandy Dunlap
210*f5461124SRandy Dunlap	void post_watch_notification(struct watch_list *wlist,
211*f5461124SRandy Dunlap				     struct watch_notification *n,
212*f5461124SRandy Dunlap				     const struct cred *cred,
213*f5461124SRandy Dunlap				     u64 id);
214*f5461124SRandy Dunlap
215*f5461124SRandy DunlapThe notification should be preformatted and a pointer to the header (``n``)
216*f5461124SRandy Dunlapshould be passed in.  The notification may be larger than this and the size in
217*f5461124SRandy Dunlapunits of buffer slots is noted in ``n->info & WATCH_INFO_LENGTH``.
218*f5461124SRandy Dunlap
219*f5461124SRandy DunlapThe ``cred`` struct indicates the credentials of the source (subject) and is
220*f5461124SRandy Dunlappassed to the LSMs, such as SELinux, to allow or suppress the recording of the
221*f5461124SRandy Dunlapnote in each individual queue according to the credentials of that queue
222*f5461124SRandy Dunlap(object).
223*f5461124SRandy Dunlap
224*f5461124SRandy DunlapThe ``id`` is the ID of the source object (such as the serial number on a key).
225*f5461124SRandy DunlapOnly watches that have the same ID set in them will see this notification.
226*f5461124SRandy Dunlap
227*f5461124SRandy Dunlap
228*f5461124SRandy DunlapWatch Sources
229*f5461124SRandy Dunlap=============
230*f5461124SRandy Dunlap
231*f5461124SRandy DunlapAny particular buffer can be fed from multiple sources.  Sources include:
232*f5461124SRandy Dunlap
233*f5461124SRandy Dunlap  * WATCH_TYPE_KEY_NOTIFY
234*f5461124SRandy Dunlap
235*f5461124SRandy Dunlap    Notifications of this type indicate changes to keys and keyrings, including
236*f5461124SRandy Dunlap    the changes of keyring contents or the attributes of keys.
237*f5461124SRandy Dunlap
238*f5461124SRandy Dunlap    See Documentation/security/keys/core.rst for more information.
239*f5461124SRandy Dunlap
240*f5461124SRandy Dunlap
241*f5461124SRandy DunlapEvent Filtering
242*f5461124SRandy Dunlap===============
243*f5461124SRandy Dunlap
244*f5461124SRandy DunlapOnce a watch queue has been created, a set of filters can be applied to limit
245*f5461124SRandy Dunlapthe events that are received using::
246*f5461124SRandy Dunlap
247*f5461124SRandy Dunlap	struct watch_notification_filter filter = {
248*f5461124SRandy Dunlap		...
249*f5461124SRandy Dunlap	};
250*f5461124SRandy Dunlap	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter)
251*f5461124SRandy Dunlap
252*f5461124SRandy DunlapThe filter description is a variable of type::
253*f5461124SRandy Dunlap
254*f5461124SRandy Dunlap	struct watch_notification_filter {
255*f5461124SRandy Dunlap		__u32	nr_filters;
256*f5461124SRandy Dunlap		__u32	__reserved;
257*f5461124SRandy Dunlap		struct watch_notification_type_filter filters[];
258*f5461124SRandy Dunlap	};
259*f5461124SRandy Dunlap
260*f5461124SRandy DunlapWhere "nr_filters" is the number of filters in filters[] and "__reserved"
261*f5461124SRandy Dunlapshould be 0.  The "filters" array has elements of the following type::
262*f5461124SRandy Dunlap
263*f5461124SRandy Dunlap	struct watch_notification_type_filter {
264*f5461124SRandy Dunlap		__u32	type;
265*f5461124SRandy Dunlap		__u32	info_filter;
266*f5461124SRandy Dunlap		__u32	info_mask;
267*f5461124SRandy Dunlap		__u32	subtype_filter[8];
268*f5461124SRandy Dunlap	};
269*f5461124SRandy Dunlap
270*f5461124SRandy DunlapWhere:
271*f5461124SRandy Dunlap
272*f5461124SRandy Dunlap  * ``type`` is the event type to filter for and should be something like
273*f5461124SRandy Dunlap    "WATCH_TYPE_KEY_NOTIFY"
274*f5461124SRandy Dunlap
275*f5461124SRandy Dunlap  * ``info_filter`` and ``info_mask`` act as a filter on the info field of the
276*f5461124SRandy Dunlap    notification record.  The notification is only written into the buffer if::
277*f5461124SRandy Dunlap
278*f5461124SRandy Dunlap	(watch.info & info_mask) == info_filter
279*f5461124SRandy Dunlap
280*f5461124SRandy Dunlap    This could be used, for example, to ignore events that are not exactly on
281*f5461124SRandy Dunlap    the watched point in a mount tree.
282*f5461124SRandy Dunlap
283*f5461124SRandy Dunlap  * ``subtype_filter`` is a bitmask indicating the subtypes that are of
284*f5461124SRandy Dunlap    interest.  Bit 0 of subtype_filter[0] corresponds to subtype 0, bit 1 to
285*f5461124SRandy Dunlap    subtype 1, and so on.
286*f5461124SRandy Dunlap
287*f5461124SRandy DunlapIf the argument to the ioctl() is NULL, then the filters will be removed and
288*f5461124SRandy Dunlapall events from the watched sources will come through.
289*f5461124SRandy Dunlap
290*f5461124SRandy Dunlap
291*f5461124SRandy DunlapUserspace Code Example
292*f5461124SRandy Dunlap======================
293*f5461124SRandy Dunlap
294*f5461124SRandy DunlapA buffer is created with something like the following::
295*f5461124SRandy Dunlap
296*f5461124SRandy Dunlap	pipe2(fds, O_TMPFILE);
297*f5461124SRandy Dunlap	ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256);
298*f5461124SRandy Dunlap
299*f5461124SRandy DunlapIt can then be set to receive keyring change notifications::
300*f5461124SRandy Dunlap
301*f5461124SRandy Dunlap	keyctl(KEYCTL_WATCH_KEY, KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
302*f5461124SRandy Dunlap
303*f5461124SRandy DunlapThe notifications can then be consumed by something like the following::
304*f5461124SRandy Dunlap
305*f5461124SRandy Dunlap	static void consumer(int rfd, struct watch_queue_buffer *buf)
306*f5461124SRandy Dunlap	{
307*f5461124SRandy Dunlap		unsigned char buffer[128];
308*f5461124SRandy Dunlap		ssize_t buf_len;
309*f5461124SRandy Dunlap
310*f5461124SRandy Dunlap		while (buf_len = read(rfd, buffer, sizeof(buffer)),
311*f5461124SRandy Dunlap		       buf_len > 0
312*f5461124SRandy Dunlap		       ) {
313*f5461124SRandy Dunlap			void *p = buffer;
314*f5461124SRandy Dunlap			void *end = buffer + buf_len;
315*f5461124SRandy Dunlap			while (p < end) {
316*f5461124SRandy Dunlap				union {
317*f5461124SRandy Dunlap					struct watch_notification n;
318*f5461124SRandy Dunlap					unsigned char buf1[128];
319*f5461124SRandy Dunlap				} n;
320*f5461124SRandy Dunlap				size_t largest, len;
321*f5461124SRandy Dunlap
322*f5461124SRandy Dunlap				largest = end - p;
323*f5461124SRandy Dunlap				if (largest > 128)
324*f5461124SRandy Dunlap					largest = 128;
325*f5461124SRandy Dunlap				memcpy(&n, p, largest);
326*f5461124SRandy Dunlap
327*f5461124SRandy Dunlap				len = (n->info & WATCH_INFO_LENGTH) >>
328*f5461124SRandy Dunlap					WATCH_INFO_LENGTH__SHIFT;
329*f5461124SRandy Dunlap				if (len == 0 || len > largest)
330*f5461124SRandy Dunlap					return;
331*f5461124SRandy Dunlap
332*f5461124SRandy Dunlap				switch (n.n.type) {
333*f5461124SRandy Dunlap				case WATCH_TYPE_META:
334*f5461124SRandy Dunlap					got_meta(&n.n);
335*f5461124SRandy Dunlap				case WATCH_TYPE_KEY_NOTIFY:
336*f5461124SRandy Dunlap					saw_key_change(&n.n);
337*f5461124SRandy Dunlap					break;
338*f5461124SRandy Dunlap				}
339*f5461124SRandy Dunlap
340*f5461124SRandy Dunlap				p += len;
341*f5461124SRandy Dunlap			}
342*f5461124SRandy Dunlap		}
343*f5461124SRandy Dunlap	}
344