1*14ebc28eSMatthew Wilcox=====================
2*14ebc28eSMatthew WilcoxThe errseq_t datatype
3*14ebc28eSMatthew Wilcox=====================
4*14ebc28eSMatthew Wilcox
5*14ebc28eSMatthew WilcoxAn errseq_t is a way of recording errors in one place, and allowing any
6*14ebc28eSMatthew Wilcoxnumber of "subscribers" to tell whether it has changed since a previous
7*14ebc28eSMatthew Wilcoxpoint where it was sampled.
8*14ebc28eSMatthew Wilcox
9*14ebc28eSMatthew WilcoxThe initial use case for this is tracking errors for file
10*14ebc28eSMatthew Wilcoxsynchronization syscalls (fsync, fdatasync, msync and sync_file_range),
11*14ebc28eSMatthew Wilcoxbut it may be usable in other situations.
12*14ebc28eSMatthew Wilcox
13*14ebc28eSMatthew WilcoxIt's implemented as an unsigned 32-bit value.  The low order bits are
14*14ebc28eSMatthew Wilcoxdesignated to hold an error code (between 1 and MAX_ERRNO).  The upper bits
15*14ebc28eSMatthew Wilcoxare used as a counter.  This is done with atomics instead of locking so that
16*14ebc28eSMatthew Wilcoxthese functions can be called from any context.
17*14ebc28eSMatthew Wilcox
18*14ebc28eSMatthew WilcoxNote that there is a risk of collisions if new errors are being recorded
19*14ebc28eSMatthew Wilcoxfrequently, since we have so few bits to use as a counter.
20*14ebc28eSMatthew Wilcox
21*14ebc28eSMatthew WilcoxTo mitigate this, the bit between the error value and counter is used as
22*14ebc28eSMatthew Wilcoxa flag to tell whether the value has been sampled since a new value was
23*14ebc28eSMatthew Wilcoxrecorded.  That allows us to avoid bumping the counter if no one has
24*14ebc28eSMatthew Wilcoxsampled it since the last time an error was recorded.
25*14ebc28eSMatthew Wilcox
26*14ebc28eSMatthew WilcoxThus we end up with a value that looks something like this:
27*14ebc28eSMatthew Wilcox
28*14ebc28eSMatthew Wilcox+--------------------------------------+----+------------------------+
29*14ebc28eSMatthew Wilcox| 31..13                               | 12 | 11..0                  |
30*14ebc28eSMatthew Wilcox+--------------------------------------+----+------------------------+
31*14ebc28eSMatthew Wilcox| counter                              | SF | errno                  |
32*14ebc28eSMatthew Wilcox+--------------------------------------+----+------------------------+
33*14ebc28eSMatthew Wilcox
34*14ebc28eSMatthew WilcoxThe general idea is for "watchers" to sample an errseq_t value and keep
35*14ebc28eSMatthew Wilcoxit as a running cursor.  That value can later be used to tell whether
36*14ebc28eSMatthew Wilcoxany new errors have occurred since that sampling was done, and atomically
37*14ebc28eSMatthew Wilcoxrecord the state at the time that it was checked.  This allows us to
38*14ebc28eSMatthew Wilcoxrecord errors in one place, and then have a number of "watchers" that
39*14ebc28eSMatthew Wilcoxcan tell whether the value has changed since they last checked it.
40*14ebc28eSMatthew Wilcox
41*14ebc28eSMatthew WilcoxA new errseq_t should always be zeroed out.  An errseq_t value of all zeroes
42*14ebc28eSMatthew Wilcoxis the special (but common) case where there has never been an error. An all
43*14ebc28eSMatthew Wilcoxzero value thus serves as the "epoch" if one wishes to know whether there
44*14ebc28eSMatthew Wilcoxhas ever been an error set since it was first initialized.
45*14ebc28eSMatthew Wilcox
46*14ebc28eSMatthew WilcoxAPI usage
47*14ebc28eSMatthew Wilcox=========
48*14ebc28eSMatthew Wilcox
49*14ebc28eSMatthew WilcoxLet me tell you a story about a worker drone.  Now, he's a good worker
50*14ebc28eSMatthew Wilcoxoverall, but the company is a little...management heavy.  He has to
51*14ebc28eSMatthew Wilcoxreport to 77 supervisors today, and tomorrow the "big boss" is coming in
52*14ebc28eSMatthew Wilcoxfrom out of town and he's sure to test the poor fellow too.
53*14ebc28eSMatthew Wilcox
54*14ebc28eSMatthew WilcoxThey're all handing him work to do -- so much he can't keep track of who
55*14ebc28eSMatthew Wilcoxhanded him what, but that's not really a big problem.  The supervisors
56*14ebc28eSMatthew Wilcoxjust want to know when he's finished all of the work they've handed him so
57*14ebc28eSMatthew Wilcoxfar and whether he made any mistakes since they last asked.
58*14ebc28eSMatthew Wilcox
59*14ebc28eSMatthew WilcoxHe might have made the mistake on work they didn't actually hand him,
60*14ebc28eSMatthew Wilcoxbut he can't keep track of things at that level of detail, all he can
61*14ebc28eSMatthew Wilcoxremember is the most recent mistake that he made.
62*14ebc28eSMatthew Wilcox
63*14ebc28eSMatthew WilcoxHere's our worker_drone representation::
64*14ebc28eSMatthew Wilcox
65*14ebc28eSMatthew Wilcox        struct worker_drone {
66*14ebc28eSMatthew Wilcox                errseq_t        wd_err; /* for recording errors */
67*14ebc28eSMatthew Wilcox        };
68*14ebc28eSMatthew Wilcox
69*14ebc28eSMatthew WilcoxEvery day, the worker_drone starts out with a blank slate::
70*14ebc28eSMatthew Wilcox
71*14ebc28eSMatthew Wilcox        struct worker_drone wd;
72*14ebc28eSMatthew Wilcox
73*14ebc28eSMatthew Wilcox        wd.wd_err = (errseq_t)0;
74*14ebc28eSMatthew Wilcox
75*14ebc28eSMatthew WilcoxThe supervisors come in and get an initial read for the day.  They
76*14ebc28eSMatthew Wilcoxdon't care about anything that happened before their watch begins::
77*14ebc28eSMatthew Wilcox
78*14ebc28eSMatthew Wilcox        struct supervisor {
79*14ebc28eSMatthew Wilcox                errseq_t        s_wd_err; /* private "cursor" for wd_err */
80*14ebc28eSMatthew Wilcox                spinlock_t      s_wd_err_lock; /* protects s_wd_err */
81*14ebc28eSMatthew Wilcox        }
82*14ebc28eSMatthew Wilcox
83*14ebc28eSMatthew Wilcox        struct supervisor       su;
84*14ebc28eSMatthew Wilcox
85*14ebc28eSMatthew Wilcox        su.s_wd_err = errseq_sample(&wd.wd_err);
86*14ebc28eSMatthew Wilcox        spin_lock_init(&su.s_wd_err_lock);
87*14ebc28eSMatthew Wilcox
88*14ebc28eSMatthew WilcoxNow they start handing him tasks to do.  Every few minutes they ask him to
89*14ebc28eSMatthew Wilcoxfinish up all of the work they've handed him so far.  Then they ask him
90*14ebc28eSMatthew Wilcoxwhether he made any mistakes on any of it::
91*14ebc28eSMatthew Wilcox
92*14ebc28eSMatthew Wilcox        spin_lock(&su.su_wd_err_lock);
93*14ebc28eSMatthew Wilcox        err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err);
94*14ebc28eSMatthew Wilcox        spin_unlock(&su.su_wd_err_lock);
95*14ebc28eSMatthew Wilcox
96*14ebc28eSMatthew WilcoxUp to this point, that just keeps returning 0.
97*14ebc28eSMatthew Wilcox
98*14ebc28eSMatthew WilcoxNow, the owners of this company are quite miserly and have given him
99*14ebc28eSMatthew Wilcoxsubstandard equipment with which to do his job. Occasionally it
100*14ebc28eSMatthew Wilcoxglitches and he makes a mistake.  He sighs a heavy sigh, and marks it
101*14ebc28eSMatthew Wilcoxdown::
102*14ebc28eSMatthew Wilcox
103*14ebc28eSMatthew Wilcox        errseq_set(&wd.wd_err, -EIO);
104*14ebc28eSMatthew Wilcox
105*14ebc28eSMatthew Wilcox...and then gets back to work.  The supervisors eventually poll again
106*14ebc28eSMatthew Wilcoxand they each get the error when they next check.  Subsequent calls will
107*14ebc28eSMatthew Wilcoxreturn 0, until another error is recorded, at which point it's reported
108*14ebc28eSMatthew Wilcoxto each of them once.
109*14ebc28eSMatthew Wilcox
110*14ebc28eSMatthew WilcoxNote that the supervisors can't tell how many mistakes he made, only
111*14ebc28eSMatthew Wilcoxwhether one was made since they last checked, and the latest value
112*14ebc28eSMatthew Wilcoxrecorded.
113*14ebc28eSMatthew Wilcox
114*14ebc28eSMatthew WilcoxOccasionally the big boss comes in for a spot check and asks the worker
115*14ebc28eSMatthew Wilcoxto do a one-off job for him. He's not really watching the worker
116*14ebc28eSMatthew Wilcoxfull-time like the supervisors, but he does need to know whether a
117*14ebc28eSMatthew Wilcoxmistake occurred while his job was processing.
118*14ebc28eSMatthew Wilcox
119*14ebc28eSMatthew WilcoxHe can just sample the current errseq_t in the worker, and then use that
120*14ebc28eSMatthew Wilcoxto tell whether an error has occurred later::
121*14ebc28eSMatthew Wilcox
122*14ebc28eSMatthew Wilcox        errseq_t since = errseq_sample(&wd.wd_err);
123*14ebc28eSMatthew Wilcox        /* submit some work and wait for it to complete */
124*14ebc28eSMatthew Wilcox        err = errseq_check(&wd.wd_err, since);
125*14ebc28eSMatthew Wilcox
126*14ebc28eSMatthew WilcoxSince he's just going to discard "since" after that point, he doesn't
127*14ebc28eSMatthew Wilcoxneed to advance it here. He also doesn't need any locking since it's
128*14ebc28eSMatthew Wilcoxnot usable by anyone else.
129*14ebc28eSMatthew Wilcox
130*14ebc28eSMatthew WilcoxSerializing errseq_t cursor updates
131*14ebc28eSMatthew Wilcox===================================
132*14ebc28eSMatthew Wilcox
133*14ebc28eSMatthew WilcoxNote that the errseq_t API does not protect the errseq_t cursor during a
134*14ebc28eSMatthew Wilcoxcheck_and_advance_operation. Only the canonical error code is handled
135*14ebc28eSMatthew Wilcoxatomically.  In a situation where more than one task might be using the
136*14ebc28eSMatthew Wilcoxsame errseq_t cursor at the same time, it's important to serialize
137*14ebc28eSMatthew Wilcoxupdates to that cursor.
138*14ebc28eSMatthew Wilcox
139*14ebc28eSMatthew WilcoxIf that's not done, then it's possible for the cursor to go backward
140*14ebc28eSMatthew Wilcoxin which case the same error could be reported more than once.
141*14ebc28eSMatthew Wilcox
142*14ebc28eSMatthew WilcoxBecause of this, it's often advantageous to first do an errseq_check to
143*14ebc28eSMatthew Wilcoxsee if anything has changed, and only later do an
144*14ebc28eSMatthew Wilcoxerrseq_check_and_advance after taking the lock. e.g.::
145*14ebc28eSMatthew Wilcox
146*14ebc28eSMatthew Wilcox        if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) {
147*14ebc28eSMatthew Wilcox                /* su.s_wd_err is protected by s_wd_err_lock */
148*14ebc28eSMatthew Wilcox                spin_lock(&su.s_wd_err_lock);
149*14ebc28eSMatthew Wilcox                err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err);
150*14ebc28eSMatthew Wilcox                spin_unlock(&su.s_wd_err_lock);
151*14ebc28eSMatthew Wilcox        }
152*14ebc28eSMatthew Wilcox
153*14ebc28eSMatthew WilcoxThat avoids the spinlock in the common case where nothing has changed
154*14ebc28eSMatthew Wilcoxsince the last time it was checked.
155*14ebc28eSMatthew Wilcox
156*14ebc28eSMatthew WilcoxFunctions
157*14ebc28eSMatthew Wilcox=========
158*14ebc28eSMatthew Wilcox
159*14ebc28eSMatthew Wilcox.. kernel-doc:: lib/errseq.c
160