11497624fSFederico Vaga.. _kernel_hacking_lock:
21497624fSFederico Vaga
3e548cdefSMauro Carvalho Chehab===========================
4e548cdefSMauro Carvalho ChehabUnreliable Guide To Locking
5e548cdefSMauro Carvalho Chehab===========================
6e548cdefSMauro Carvalho Chehab
7e548cdefSMauro Carvalho Chehab:Author: Rusty Russell
8e548cdefSMauro Carvalho Chehab
9e548cdefSMauro Carvalho ChehabIntroduction
10e548cdefSMauro Carvalho Chehab============
11e548cdefSMauro Carvalho Chehab
12e548cdefSMauro Carvalho ChehabWelcome, to Rusty's Remarkably Unreliable Guide to Kernel Locking
13e548cdefSMauro Carvalho Chehabissues. This document describes the locking systems in the Linux Kernel
14e548cdefSMauro Carvalho Chehabin 2.6.
15e548cdefSMauro Carvalho Chehab
16e548cdefSMauro Carvalho ChehabWith the wide availability of HyperThreading, and preemption in the
17e548cdefSMauro Carvalho ChehabLinux Kernel, everyone hacking on the kernel needs to know the
18e548cdefSMauro Carvalho Chehabfundamentals of concurrency and locking for SMP.
19e548cdefSMauro Carvalho Chehab
20e548cdefSMauro Carvalho ChehabThe Problem With Concurrency
21e548cdefSMauro Carvalho Chehab============================
22e548cdefSMauro Carvalho Chehab
23e548cdefSMauro Carvalho Chehab(Skip this if you know what a Race Condition is).
24e548cdefSMauro Carvalho Chehab
25e548cdefSMauro Carvalho ChehabIn a normal program, you can increment a counter like so:
26e548cdefSMauro Carvalho Chehab
27e548cdefSMauro Carvalho Chehab::
28e548cdefSMauro Carvalho Chehab
29e548cdefSMauro Carvalho Chehab          very_important_count++;
30e548cdefSMauro Carvalho Chehab
31e548cdefSMauro Carvalho Chehab
32e548cdefSMauro Carvalho ChehabThis is what they would expect to happen:
33e548cdefSMauro Carvalho Chehab
34475c5ef8SMauro Carvalho Chehab
35475c5ef8SMauro Carvalho Chehab.. table:: Expected Results
36475c5ef8SMauro Carvalho Chehab
37e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
38e548cdefSMauro Carvalho Chehab  | Instance 1                         | Instance 2                         |
39e548cdefSMauro Carvalho Chehab  +====================================+====================================+
40e548cdefSMauro Carvalho Chehab  | read very_important_count (5)      |                                    |
41e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
42e548cdefSMauro Carvalho Chehab  | add 1 (6)                          |                                    |
43e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
44e548cdefSMauro Carvalho Chehab  | write very_important_count (6)     |                                    |
45e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
46e548cdefSMauro Carvalho Chehab  |                                    | read very_important_count (6)      |
47e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
48e548cdefSMauro Carvalho Chehab  |                                    | add 1 (7)                          |
49e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
50e548cdefSMauro Carvalho Chehab  |                                    | write very_important_count (7)     |
51e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
52e548cdefSMauro Carvalho Chehab
53e548cdefSMauro Carvalho ChehabThis is what might happen:
54e548cdefSMauro Carvalho Chehab
55475c5ef8SMauro Carvalho Chehab.. table:: Possible Results
56475c5ef8SMauro Carvalho Chehab
57e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
58e548cdefSMauro Carvalho Chehab  | Instance 1                         | Instance 2                         |
59e548cdefSMauro Carvalho Chehab  +====================================+====================================+
60e548cdefSMauro Carvalho Chehab  | read very_important_count (5)      |                                    |
61e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
62e548cdefSMauro Carvalho Chehab  |                                    | read very_important_count (5)      |
63e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
64e548cdefSMauro Carvalho Chehab  | add 1 (6)                          |                                    |
65e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
66e548cdefSMauro Carvalho Chehab  |                                    | add 1 (6)                          |
67e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
68e548cdefSMauro Carvalho Chehab  | write very_important_count (6)     |                                    |
69e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
70e548cdefSMauro Carvalho Chehab  |                                    | write very_important_count (6)     |
71e548cdefSMauro Carvalho Chehab  +------------------------------------+------------------------------------+
72e548cdefSMauro Carvalho Chehab
73e548cdefSMauro Carvalho Chehab
74e548cdefSMauro Carvalho ChehabRace Conditions and Critical Regions
75e548cdefSMauro Carvalho Chehab------------------------------------
76e548cdefSMauro Carvalho Chehab
77e548cdefSMauro Carvalho ChehabThis overlap, where the result depends on the relative timing of
78e548cdefSMauro Carvalho Chehabmultiple tasks, is called a race condition. The piece of code containing
79e548cdefSMauro Carvalho Chehabthe concurrency issue is called a critical region. And especially since
80e548cdefSMauro Carvalho ChehabLinux starting running on SMP machines, they became one of the major
81e548cdefSMauro Carvalho Chehabissues in kernel design and implementation.
82e548cdefSMauro Carvalho Chehab
83e548cdefSMauro Carvalho ChehabPreemption can have the same effect, even if there is only one CPU: by
84e548cdefSMauro Carvalho Chehabpreempting one task during the critical region, we have exactly the same
85e548cdefSMauro Carvalho Chehabrace condition. In this case the thread which preempts might run the
86e548cdefSMauro Carvalho Chehabcritical region itself.
87e548cdefSMauro Carvalho Chehab
88e548cdefSMauro Carvalho ChehabThe solution is to recognize when these simultaneous accesses occur, and
89e548cdefSMauro Carvalho Chehabuse locks to make sure that only one instance can enter the critical
90e548cdefSMauro Carvalho Chehabregion at any time. There are many friendly primitives in the Linux
91e548cdefSMauro Carvalho Chehabkernel to help you do this. And then there are the unfriendly
92e548cdefSMauro Carvalho Chehabprimitives, but I'll pretend they don't exist.
93e548cdefSMauro Carvalho Chehab
94e548cdefSMauro Carvalho ChehabLocking in the Linux Kernel
95e548cdefSMauro Carvalho Chehab===========================
96e548cdefSMauro Carvalho Chehab
97abf36fe0SAlyssa RosenzweigIf I could give you one piece of advice on locking: **keep it simple**.
98e548cdefSMauro Carvalho Chehab
99e548cdefSMauro Carvalho ChehabBe reluctant to introduce new locks.
100e548cdefSMauro Carvalho Chehab
101e548cdefSMauro Carvalho ChehabTwo Main Types of Kernel Locks: Spinlocks and Mutexes
102e548cdefSMauro Carvalho Chehab-----------------------------------------------------
103e548cdefSMauro Carvalho Chehab
104e548cdefSMauro Carvalho ChehabThere are two main types of kernel locks. The fundamental type is the
105e548cdefSMauro Carvalho Chehabspinlock (``include/asm/spinlock.h``), which is a very simple
106e548cdefSMauro Carvalho Chehabsingle-holder lock: if you can't get the spinlock, you keep trying
107e548cdefSMauro Carvalho Chehab(spinning) until you can. Spinlocks are very small and fast, and can be
108e548cdefSMauro Carvalho Chehabused anywhere.
109e548cdefSMauro Carvalho Chehab
110e548cdefSMauro Carvalho ChehabThe second type is a mutex (``include/linux/mutex.h``): it is like a
111e548cdefSMauro Carvalho Chehabspinlock, but you may block holding a mutex. If you can't lock a mutex,
112e548cdefSMauro Carvalho Chehabyour task will suspend itself, and be woken up when the mutex is
113e548cdefSMauro Carvalho Chehabreleased. This means the CPU can do something else while you are
114e548cdefSMauro Carvalho Chehabwaiting. There are many cases when you simply can't sleep (see
1154f8af077SNícolas F. R. A. Prado`What Functions Are Safe To Call From Interrupts?`_),
116e548cdefSMauro Carvalho Chehaband so have to use a spinlock instead.
117e548cdefSMauro Carvalho Chehab
118e548cdefSMauro Carvalho ChehabNeither type of lock is recursive: see
1194f8af077SNícolas F. R. A. Prado`Deadlock: Simple and Advanced`_.
120e548cdefSMauro Carvalho Chehab
121e548cdefSMauro Carvalho ChehabLocks and Uniprocessor Kernels
122e548cdefSMauro Carvalho Chehab------------------------------
123e548cdefSMauro Carvalho Chehab
124e548cdefSMauro Carvalho ChehabFor kernels compiled without ``CONFIG_SMP``, and without
125e548cdefSMauro Carvalho Chehab``CONFIG_PREEMPT`` spinlocks do not exist at all. This is an excellent
126e548cdefSMauro Carvalho Chehabdesign decision: when no-one else can run at the same time, there is no
127e548cdefSMauro Carvalho Chehabreason to have a lock.
128e548cdefSMauro Carvalho Chehab
129e548cdefSMauro Carvalho ChehabIf the kernel is compiled without ``CONFIG_SMP``, but ``CONFIG_PREEMPT``
130e548cdefSMauro Carvalho Chehabis set, then spinlocks simply disable preemption, which is sufficient to
131e548cdefSMauro Carvalho Chehabprevent any races. For most purposes, we can think of preemption as
132e548cdefSMauro Carvalho Chehabequivalent to SMP, and not worry about it separately.
133e548cdefSMauro Carvalho Chehab
134e548cdefSMauro Carvalho ChehabYou should always test your locking code with ``CONFIG_SMP`` and
135e548cdefSMauro Carvalho Chehab``CONFIG_PREEMPT`` enabled, even if you don't have an SMP test box,
136e548cdefSMauro Carvalho Chehabbecause it will still catch some kinds of locking bugs.
137e548cdefSMauro Carvalho Chehab
138e548cdefSMauro Carvalho ChehabMutexes still exist, because they are required for synchronization
139e548cdefSMauro Carvalho Chehabbetween user contexts, as we will see below.
140e548cdefSMauro Carvalho Chehab
141e548cdefSMauro Carvalho ChehabLocking Only In User Context
142e548cdefSMauro Carvalho Chehab----------------------------
143e548cdefSMauro Carvalho Chehab
144e548cdefSMauro Carvalho ChehabIf you have a data structure which is only ever accessed from user
145e548cdefSMauro Carvalho Chehabcontext, then you can use a simple mutex (``include/linux/mutex.h``) to
146e548cdefSMauro Carvalho Chehabprotect it. This is the most trivial case: you initialize the mutex.
147b1735296SStephen BoydThen you can call mutex_lock_interruptible() to grab the
148b1735296SStephen Boydmutex, and mutex_unlock() to release it. There is also a
149b1735296SStephen Boydmutex_lock(), which should be avoided, because it will
150e548cdefSMauro Carvalho Chehabnot return if a signal is received.
151e548cdefSMauro Carvalho Chehab
152e548cdefSMauro Carvalho ChehabExample: ``net/netfilter/nf_sockopt.c`` allows registration of new
153b1735296SStephen Boydsetsockopt() and getsockopt() calls, with
154b1735296SStephen Boydnf_register_sockopt(). Registration and de-registration
155e548cdefSMauro Carvalho Chehabare only done on module load and unload (and boot time, where there is
156e548cdefSMauro Carvalho Chehabno concurrency), and the list of registrations is only consulted for an
157b1735296SStephen Boydunknown setsockopt() or getsockopt() system
158e548cdefSMauro Carvalho Chehabcall. The ``nf_sockopt_mutex`` is perfect to protect this, especially
159e548cdefSMauro Carvalho Chehabsince the setsockopt and getsockopt calls may well sleep.
160e548cdefSMauro Carvalho Chehab
161e548cdefSMauro Carvalho ChehabLocking Between User Context and Softirqs
162e548cdefSMauro Carvalho Chehab-----------------------------------------
163e548cdefSMauro Carvalho Chehab
164e548cdefSMauro Carvalho ChehabIf a softirq shares data with user context, you have two problems.
165e548cdefSMauro Carvalho ChehabFirstly, the current user context can be interrupted by a softirq, and
166e548cdefSMauro Carvalho Chehabsecondly, the critical region could be entered from another CPU. This is
167b1735296SStephen Boydwhere spin_lock_bh() (``include/linux/spinlock.h``) is
168e548cdefSMauro Carvalho Chehabused. It disables softirqs on that CPU, then grabs the lock.
169b1735296SStephen Boydspin_unlock_bh() does the reverse. (The '_bh' suffix is
170e548cdefSMauro Carvalho Chehaba historical reference to "Bottom Halves", the old name for software
171e548cdefSMauro Carvalho Chehabinterrupts. It should really be called spin_lock_softirq()' in a
172e548cdefSMauro Carvalho Chehabperfect world).
173e548cdefSMauro Carvalho Chehab
174b1735296SStephen BoydNote that you can also use spin_lock_irq() or
175b1735296SStephen Boydspin_lock_irqsave() here, which stop hardware interrupts
1764f8af077SNícolas F. R. A. Pradoas well: see `Hard IRQ Context`_.
177e548cdefSMauro Carvalho Chehab
178e548cdefSMauro Carvalho ChehabThis works perfectly for UP as well: the spin lock vanishes, and this
179b1735296SStephen Boydmacro simply becomes local_bh_disable()
180e548cdefSMauro Carvalho Chehab(``include/linux/interrupt.h``), which protects you from the softirq
181e548cdefSMauro Carvalho Chehabbeing run.
182e548cdefSMauro Carvalho Chehab
183e548cdefSMauro Carvalho ChehabLocking Between User Context and Tasklets
184e548cdefSMauro Carvalho Chehab-----------------------------------------
185e548cdefSMauro Carvalho Chehab
186e548cdefSMauro Carvalho ChehabThis is exactly the same as above, because tasklets are actually run
187e548cdefSMauro Carvalho Chehabfrom a softirq.
188e548cdefSMauro Carvalho Chehab
189e548cdefSMauro Carvalho ChehabLocking Between User Context and Timers
190e548cdefSMauro Carvalho Chehab---------------------------------------
191e548cdefSMauro Carvalho Chehab
192e548cdefSMauro Carvalho ChehabThis, too, is exactly the same as above, because timers are actually run
193e548cdefSMauro Carvalho Chehabfrom a softirq. From a locking point of view, tasklets and timers are
194e548cdefSMauro Carvalho Chehabidentical.
195e548cdefSMauro Carvalho Chehab
196e548cdefSMauro Carvalho ChehabLocking Between Tasklets/Timers
197e548cdefSMauro Carvalho Chehab-------------------------------
198e548cdefSMauro Carvalho Chehab
199e548cdefSMauro Carvalho ChehabSometimes a tasklet or timer might want to share data with another
200e548cdefSMauro Carvalho Chehabtasklet or timer.
201e548cdefSMauro Carvalho Chehab
202e548cdefSMauro Carvalho ChehabThe Same Tasklet/Timer
203e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~
204e548cdefSMauro Carvalho Chehab
205e548cdefSMauro Carvalho ChehabSince a tasklet is never run on two CPUs at once, you don't need to
206e548cdefSMauro Carvalho Chehabworry about your tasklet being reentrant (running twice at once), even
207e548cdefSMauro Carvalho Chehabon SMP.
208e548cdefSMauro Carvalho Chehab
209e548cdefSMauro Carvalho ChehabDifferent Tasklets/Timers
210e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~
211e548cdefSMauro Carvalho Chehab
212e548cdefSMauro Carvalho ChehabIf another tasklet/timer wants to share data with your tasklet or timer
213b1735296SStephen Boyd, you will both need to use spin_lock() and
214b1735296SStephen Boydspin_unlock() calls. spin_lock_bh() is
215e548cdefSMauro Carvalho Chehabunnecessary here, as you are already in a tasklet, and none will be run
216e548cdefSMauro Carvalho Chehabon the same CPU.
217e548cdefSMauro Carvalho Chehab
218e548cdefSMauro Carvalho ChehabLocking Between Softirqs
219e548cdefSMauro Carvalho Chehab------------------------
220e548cdefSMauro Carvalho Chehab
221e548cdefSMauro Carvalho ChehabOften a softirq might want to share data with itself or a tasklet/timer.
222e548cdefSMauro Carvalho Chehab
223e548cdefSMauro Carvalho ChehabThe Same Softirq
224e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~
225e548cdefSMauro Carvalho Chehab
226e548cdefSMauro Carvalho ChehabThe same softirq can run on the other CPUs: you can use a per-CPU array
2274f8af077SNícolas F. R. A. Prado(see `Per-CPU Data`_) for better performance. If you're
228e548cdefSMauro Carvalho Chehabgoing so far as to use a softirq, you probably care about scalable
229e548cdefSMauro Carvalho Chehabperformance enough to justify the extra complexity.
230e548cdefSMauro Carvalho Chehab
231b1735296SStephen BoydYou'll need to use spin_lock() and
232b1735296SStephen Boydspin_unlock() for shared data.
233e548cdefSMauro Carvalho Chehab
234e548cdefSMauro Carvalho ChehabDifferent Softirqs
235e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~
236e548cdefSMauro Carvalho Chehab
237b1735296SStephen BoydYou'll need to use spin_lock() and
238b1735296SStephen Boydspin_unlock() for shared data, whether it be a timer,
239e548cdefSMauro Carvalho Chehabtasklet, different softirq or the same or another softirq: any of them
240e548cdefSMauro Carvalho Chehabcould be running on a different CPU.
241e548cdefSMauro Carvalho Chehab
242e548cdefSMauro Carvalho ChehabHard IRQ Context
243e548cdefSMauro Carvalho Chehab================
244e548cdefSMauro Carvalho Chehab
245e548cdefSMauro Carvalho ChehabHardware interrupts usually communicate with a tasklet or softirq.
246e548cdefSMauro Carvalho ChehabFrequently this involves putting work in a queue, which the softirq will
247e548cdefSMauro Carvalho Chehabtake out.
248e548cdefSMauro Carvalho Chehab
249e548cdefSMauro Carvalho ChehabLocking Between Hard IRQ and Softirqs/Tasklets
250e548cdefSMauro Carvalho Chehab----------------------------------------------
251e548cdefSMauro Carvalho Chehab
252e548cdefSMauro Carvalho ChehabIf a hardware irq handler shares data with a softirq, you have two
253e548cdefSMauro Carvalho Chehabconcerns. Firstly, the softirq processing can be interrupted by a
254e548cdefSMauro Carvalho Chehabhardware interrupt, and secondly, the critical region could be entered
255e548cdefSMauro Carvalho Chehabby a hardware interrupt on another CPU. This is where
256b1735296SStephen Boydspin_lock_irq() is used. It is defined to disable
257e548cdefSMauro Carvalho Chehabinterrupts on that cpu, then grab the lock.
258b1735296SStephen Boydspin_unlock_irq() does the reverse.
259e548cdefSMauro Carvalho Chehab
260b1735296SStephen BoydThe irq handler does not need to use spin_lock_irq(), because
261e548cdefSMauro Carvalho Chehabthe softirq cannot run while the irq handler is running: it can use
262b1735296SStephen Boydspin_lock(), which is slightly faster. The only exception
263e548cdefSMauro Carvalho Chehabwould be if a different hardware irq handler uses the same lock:
264b1735296SStephen Boydspin_lock_irq() will stop that from interrupting us.
265e548cdefSMauro Carvalho Chehab
266e548cdefSMauro Carvalho ChehabThis works perfectly for UP as well: the spin lock vanishes, and this
267b1735296SStephen Boydmacro simply becomes local_irq_disable()
268e548cdefSMauro Carvalho Chehab(``include/asm/smp.h``), which protects you from the softirq/tasklet/BH
269e548cdefSMauro Carvalho Chehabbeing run.
270e548cdefSMauro Carvalho Chehab
271b1735296SStephen Boydspin_lock_irqsave() (``include/linux/spinlock.h``) is a
272e548cdefSMauro Carvalho Chehabvariant which saves whether interrupts were on or off in a flags word,
273b1735296SStephen Boydwhich is passed to spin_unlock_irqrestore(). This means
274e548cdefSMauro Carvalho Chehabthat the same code can be used inside an hard irq handler (where
275e548cdefSMauro Carvalho Chehabinterrupts are already off) and in softirqs (where the irq disabling is
276e548cdefSMauro Carvalho Chehabrequired).
277e548cdefSMauro Carvalho Chehab
278e548cdefSMauro Carvalho ChehabNote that softirqs (and hence tasklets and timers) are run on return
279b1735296SStephen Boydfrom hardware interrupts, so spin_lock_irq() also stops
280b1735296SStephen Boydthese. In that sense, spin_lock_irqsave() is the most
281e548cdefSMauro Carvalho Chehabgeneral and powerful locking function.
282e548cdefSMauro Carvalho Chehab
283e548cdefSMauro Carvalho ChehabLocking Between Two Hard IRQ Handlers
284e548cdefSMauro Carvalho Chehab-------------------------------------
285e548cdefSMauro Carvalho Chehab
286e548cdefSMauro Carvalho ChehabIt is rare to have to share data between two IRQ handlers, but if you
287b1735296SStephen Boyddo, spin_lock_irqsave() should be used: it is
288e548cdefSMauro Carvalho Chehabarchitecture-specific whether all interrupts are disabled inside irq
289e548cdefSMauro Carvalho Chehabhandlers themselves.
290e548cdefSMauro Carvalho Chehab
291e548cdefSMauro Carvalho ChehabCheat Sheet For Locking
292e548cdefSMauro Carvalho Chehab=======================
293e548cdefSMauro Carvalho Chehab
294e548cdefSMauro Carvalho ChehabPete Zaitcev gives the following summary:
295e548cdefSMauro Carvalho Chehab
296e548cdefSMauro Carvalho Chehab-  If you are in a process context (any syscall) and want to lock other
297e548cdefSMauro Carvalho Chehab   process out, use a mutex. You can take a mutex and sleep
29810855b45STakahiro Itazuri   (``copy_from_user()`` or ``kmalloc(x,GFP_KERNEL)``).
299e548cdefSMauro Carvalho Chehab
300e548cdefSMauro Carvalho Chehab-  Otherwise (== data can be touched in an interrupt), use
301b1735296SStephen Boyd   spin_lock_irqsave() and
302b1735296SStephen Boyd   spin_unlock_irqrestore().
303e548cdefSMauro Carvalho Chehab
304e548cdefSMauro Carvalho Chehab-  Avoid holding spinlock for more than 5 lines of code and across any
305b1735296SStephen Boyd   function call (except accessors like readb()).
306e548cdefSMauro Carvalho Chehab
307e548cdefSMauro Carvalho ChehabTable of Minimum Requirements
308e548cdefSMauro Carvalho Chehab-----------------------------
309e548cdefSMauro Carvalho Chehab
310dc89fca9SMauro Carvalho ChehabThe following table lists the **minimum** locking requirements between
311e548cdefSMauro Carvalho Chehabvarious contexts. In some cases, the same context can only be running on
312e548cdefSMauro Carvalho Chehabone CPU at a time, so no locking is required for that context (eg. a
313e548cdefSMauro Carvalho Chehabparticular thread can only run on one CPU at a time, but if it needs
314e548cdefSMauro Carvalho Chehabshares data with another thread, locking is required).
315e548cdefSMauro Carvalho Chehab
316e548cdefSMauro Carvalho ChehabRemember the advice above: you can always use
317b1735296SStephen Boydspin_lock_irqsave(), which is a superset of all other
318e548cdefSMauro Carvalho Chehabspinlock primitives.
319e548cdefSMauro Carvalho Chehab
3205b9fd1d3SMauro Carvalho Chehab============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ==============
3215b9fd1d3SMauro Carvalho Chehab.              IRQ Handler A IRQ Handler B Softirq A Softirq B Tasklet A Tasklet B Timer A Timer B User Context A User Context B
3225b9fd1d3SMauro Carvalho Chehab============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ==============
3235b9fd1d3SMauro Carvalho ChehabIRQ Handler A  None
3245b9fd1d3SMauro Carvalho ChehabIRQ Handler B  SLIS          None
3255b9fd1d3SMauro Carvalho ChehabSoftirq A      SLI           SLI           SL
3265b9fd1d3SMauro Carvalho ChehabSoftirq B      SLI           SLI           SL        SL
3275b9fd1d3SMauro Carvalho ChehabTasklet A      SLI           SLI           SL        SL        None
3285b9fd1d3SMauro Carvalho ChehabTasklet B      SLI           SLI           SL        SL        SL        None
3295b9fd1d3SMauro Carvalho ChehabTimer A        SLI           SLI           SL        SL        SL        SL        None
3305b9fd1d3SMauro Carvalho ChehabTimer B        SLI           SLI           SL        SL        SL        SL        SL      None
3315b9fd1d3SMauro Carvalho ChehabUser Context A SLI           SLI           SLBH      SLBH      SLBH      SLBH      SLBH    SLBH    None
3325b9fd1d3SMauro Carvalho ChehabUser Context B SLI           SLI           SLBH      SLBH      SLBH      SLBH      SLBH    SLBH    MLI            None
3335b9fd1d3SMauro Carvalho Chehab============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ==============
334e548cdefSMauro Carvalho Chehab
335e548cdefSMauro Carvalho ChehabTable: Table of Locking Requirements
336e548cdefSMauro Carvalho Chehab
337e548cdefSMauro Carvalho Chehab+--------+----------------------------+
338e548cdefSMauro Carvalho Chehab| SLIS   | spin_lock_irqsave          |
339e548cdefSMauro Carvalho Chehab+--------+----------------------------+
340e548cdefSMauro Carvalho Chehab| SLI    | spin_lock_irq              |
341e548cdefSMauro Carvalho Chehab+--------+----------------------------+
342e548cdefSMauro Carvalho Chehab| SL     | spin_lock                  |
343e548cdefSMauro Carvalho Chehab+--------+----------------------------+
344e548cdefSMauro Carvalho Chehab| SLBH   | spin_lock_bh               |
345e548cdefSMauro Carvalho Chehab+--------+----------------------------+
346e548cdefSMauro Carvalho Chehab| MLI    | mutex_lock_interruptible   |
347e548cdefSMauro Carvalho Chehab+--------+----------------------------+
348e548cdefSMauro Carvalho Chehab
349e548cdefSMauro Carvalho ChehabTable: Legend for Locking Requirements Table
350e548cdefSMauro Carvalho Chehab
351e548cdefSMauro Carvalho ChehabThe trylock Functions
352e548cdefSMauro Carvalho Chehab=====================
353e548cdefSMauro Carvalho Chehab
354e548cdefSMauro Carvalho ChehabThere are functions that try to acquire a lock only once and immediately
355e548cdefSMauro Carvalho Chehabreturn a value telling about success or failure to acquire the lock.
356e548cdefSMauro Carvalho ChehabThey can be used if you need no access to the data protected with the
357e548cdefSMauro Carvalho Chehablock when some other thread is holding the lock. You should acquire the
358e548cdefSMauro Carvalho Chehablock later if you then need access to the data protected with the lock.
359e548cdefSMauro Carvalho Chehab
360b1735296SStephen Boydspin_trylock() does not spin but returns non-zero if it
361e548cdefSMauro Carvalho Chehabacquires the spinlock on the first try or 0 if not. This function can be
362b1735296SStephen Boydused in all contexts like spin_lock(): you must have
363e548cdefSMauro Carvalho Chehabdisabled the contexts that might interrupt you and acquire the spin
364e548cdefSMauro Carvalho Chehablock.
365e548cdefSMauro Carvalho Chehab
366b1735296SStephen Boydmutex_trylock() does not suspend your task but returns
367e548cdefSMauro Carvalho Chehabnon-zero if it could lock the mutex on the first try or 0 if not. This
368e548cdefSMauro Carvalho Chehabfunction cannot be safely used in hardware or software interrupt
369e548cdefSMauro Carvalho Chehabcontexts despite not sleeping.
370e548cdefSMauro Carvalho Chehab
371e548cdefSMauro Carvalho ChehabCommon Examples
372e548cdefSMauro Carvalho Chehab===============
373e548cdefSMauro Carvalho Chehab
374e548cdefSMauro Carvalho ChehabLet's step through a simple example: a cache of number to name mappings.
375e548cdefSMauro Carvalho ChehabThe cache keeps a count of how often each of the objects is used, and
376e548cdefSMauro Carvalho Chehabwhen it gets full, throws out the least used one.
377e548cdefSMauro Carvalho Chehab
378e548cdefSMauro Carvalho ChehabAll In User Context
379e548cdefSMauro Carvalho Chehab-------------------
380e548cdefSMauro Carvalho Chehab
381e548cdefSMauro Carvalho ChehabFor our first example, we assume that all operations are in user context
382e548cdefSMauro Carvalho Chehab(ie. from system calls), so we can sleep. This means we can use a mutex
383e548cdefSMauro Carvalho Chehabto protect the cache and all the objects within it. Here's the code::
384e548cdefSMauro Carvalho Chehab
385e548cdefSMauro Carvalho Chehab    #include <linux/list.h>
386e548cdefSMauro Carvalho Chehab    #include <linux/slab.h>
387e548cdefSMauro Carvalho Chehab    #include <linux/string.h>
388e548cdefSMauro Carvalho Chehab    #include <linux/mutex.h>
389e548cdefSMauro Carvalho Chehab    #include <asm/errno.h>
390e548cdefSMauro Carvalho Chehab
391e548cdefSMauro Carvalho Chehab    struct object
392e548cdefSMauro Carvalho Chehab    {
393e548cdefSMauro Carvalho Chehab            struct list_head list;
394e548cdefSMauro Carvalho Chehab            int id;
395e548cdefSMauro Carvalho Chehab            char name[32];
396e548cdefSMauro Carvalho Chehab            int popularity;
397e548cdefSMauro Carvalho Chehab    };
398e548cdefSMauro Carvalho Chehab
399e548cdefSMauro Carvalho Chehab    /* Protects the cache, cache_num, and the objects within it */
400e548cdefSMauro Carvalho Chehab    static DEFINE_MUTEX(cache_lock);
401e548cdefSMauro Carvalho Chehab    static LIST_HEAD(cache);
402e548cdefSMauro Carvalho Chehab    static unsigned int cache_num = 0;
403e548cdefSMauro Carvalho Chehab    #define MAX_CACHE_SIZE 10
404e548cdefSMauro Carvalho Chehab
405e548cdefSMauro Carvalho Chehab    /* Must be holding cache_lock */
406e548cdefSMauro Carvalho Chehab    static struct object *__cache_find(int id)
407e548cdefSMauro Carvalho Chehab    {
408e548cdefSMauro Carvalho Chehab            struct object *i;
409e548cdefSMauro Carvalho Chehab
410e548cdefSMauro Carvalho Chehab            list_for_each_entry(i, &cache, list)
411e548cdefSMauro Carvalho Chehab                    if (i->id == id) {
412e548cdefSMauro Carvalho Chehab                            i->popularity++;
413e548cdefSMauro Carvalho Chehab                            return i;
414e548cdefSMauro Carvalho Chehab                    }
415e548cdefSMauro Carvalho Chehab            return NULL;
416e548cdefSMauro Carvalho Chehab    }
417e548cdefSMauro Carvalho Chehab
418e548cdefSMauro Carvalho Chehab    /* Must be holding cache_lock */
419e548cdefSMauro Carvalho Chehab    static void __cache_delete(struct object *obj)
420e548cdefSMauro Carvalho Chehab    {
421e548cdefSMauro Carvalho Chehab            BUG_ON(!obj);
422e548cdefSMauro Carvalho Chehab            list_del(&obj->list);
423e548cdefSMauro Carvalho Chehab            kfree(obj);
424e548cdefSMauro Carvalho Chehab            cache_num--;
425e548cdefSMauro Carvalho Chehab    }
426e548cdefSMauro Carvalho Chehab
427e548cdefSMauro Carvalho Chehab    /* Must be holding cache_lock */
428e548cdefSMauro Carvalho Chehab    static void __cache_add(struct object *obj)
429e548cdefSMauro Carvalho Chehab    {
430e548cdefSMauro Carvalho Chehab            list_add(&obj->list, &cache);
431e548cdefSMauro Carvalho Chehab            if (++cache_num > MAX_CACHE_SIZE) {
432e548cdefSMauro Carvalho Chehab                    struct object *i, *outcast = NULL;
433e548cdefSMauro Carvalho Chehab                    list_for_each_entry(i, &cache, list) {
434e548cdefSMauro Carvalho Chehab                            if (!outcast || i->popularity < outcast->popularity)
435e548cdefSMauro Carvalho Chehab                                    outcast = i;
436e548cdefSMauro Carvalho Chehab                    }
437e548cdefSMauro Carvalho Chehab                    __cache_delete(outcast);
438e548cdefSMauro Carvalho Chehab            }
439e548cdefSMauro Carvalho Chehab    }
440e548cdefSMauro Carvalho Chehab
441e548cdefSMauro Carvalho Chehab    int cache_add(int id, const char *name)
442e548cdefSMauro Carvalho Chehab    {
443e548cdefSMauro Carvalho Chehab            struct object *obj;
444e548cdefSMauro Carvalho Chehab
445e548cdefSMauro Carvalho Chehab            if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
446e548cdefSMauro Carvalho Chehab                    return -ENOMEM;
447e548cdefSMauro Carvalho Chehab
448220ee02aSStephen Kitt            strscpy(obj->name, name, sizeof(obj->name));
449e548cdefSMauro Carvalho Chehab            obj->id = id;
450e548cdefSMauro Carvalho Chehab            obj->popularity = 0;
451e548cdefSMauro Carvalho Chehab
452e548cdefSMauro Carvalho Chehab            mutex_lock(&cache_lock);
453e548cdefSMauro Carvalho Chehab            __cache_add(obj);
454e548cdefSMauro Carvalho Chehab            mutex_unlock(&cache_lock);
455e548cdefSMauro Carvalho Chehab            return 0;
456e548cdefSMauro Carvalho Chehab    }
457e548cdefSMauro Carvalho Chehab
458e548cdefSMauro Carvalho Chehab    void cache_delete(int id)
459e548cdefSMauro Carvalho Chehab    {
460e548cdefSMauro Carvalho Chehab            mutex_lock(&cache_lock);
461e548cdefSMauro Carvalho Chehab            __cache_delete(__cache_find(id));
462e548cdefSMauro Carvalho Chehab            mutex_unlock(&cache_lock);
463e548cdefSMauro Carvalho Chehab    }
464e548cdefSMauro Carvalho Chehab
465e548cdefSMauro Carvalho Chehab    int cache_find(int id, char *name)
466e548cdefSMauro Carvalho Chehab    {
467e548cdefSMauro Carvalho Chehab            struct object *obj;
468e548cdefSMauro Carvalho Chehab            int ret = -ENOENT;
469e548cdefSMauro Carvalho Chehab
470e548cdefSMauro Carvalho Chehab            mutex_lock(&cache_lock);
471e548cdefSMauro Carvalho Chehab            obj = __cache_find(id);
472e548cdefSMauro Carvalho Chehab            if (obj) {
473e548cdefSMauro Carvalho Chehab                    ret = 0;
474e548cdefSMauro Carvalho Chehab                    strcpy(name, obj->name);
475e548cdefSMauro Carvalho Chehab            }
476e548cdefSMauro Carvalho Chehab            mutex_unlock(&cache_lock);
477e548cdefSMauro Carvalho Chehab            return ret;
478e548cdefSMauro Carvalho Chehab    }
479e548cdefSMauro Carvalho Chehab
480e548cdefSMauro Carvalho ChehabNote that we always make sure we have the cache_lock when we add,
481e548cdefSMauro Carvalho Chehabdelete, or look up the cache: both the cache infrastructure itself and
482e548cdefSMauro Carvalho Chehabthe contents of the objects are protected by the lock. In this case it's
483e548cdefSMauro Carvalho Chehabeasy, since we copy the data for the user, and never let them access the
484e548cdefSMauro Carvalho Chehabobjects directly.
485e548cdefSMauro Carvalho Chehab
486e548cdefSMauro Carvalho ChehabThere is a slight (and common) optimization here: in
487b1735296SStephen Boydcache_add() we set up the fields of the object before
488e548cdefSMauro Carvalho Chehabgrabbing the lock. This is safe, as no-one else can access it until we
489e548cdefSMauro Carvalho Chehabput it in cache.
490e548cdefSMauro Carvalho Chehab
491e548cdefSMauro Carvalho ChehabAccessing From Interrupt Context
492e548cdefSMauro Carvalho Chehab--------------------------------
493e548cdefSMauro Carvalho Chehab
494b1735296SStephen BoydNow consider the case where cache_find() can be called
495e548cdefSMauro Carvalho Chehabfrom interrupt context: either a hardware interrupt or a softirq. An
496e548cdefSMauro Carvalho Chehabexample would be a timer which deletes object from the cache.
497e548cdefSMauro Carvalho Chehab
498e548cdefSMauro Carvalho ChehabThe change is shown below, in standard patch format: the ``-`` are lines
499e548cdefSMauro Carvalho Chehabwhich are taken away, and the ``+`` are lines which are added.
500e548cdefSMauro Carvalho Chehab
501e548cdefSMauro Carvalho Chehab::
502e548cdefSMauro Carvalho Chehab
503e548cdefSMauro Carvalho Chehab    --- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100
504e548cdefSMauro Carvalho Chehab    +++ cache.c.interrupt   2003-12-09 14:07:49.000000000 +1100
505e548cdefSMauro Carvalho Chehab    @@ -12,7 +12,7 @@
506e548cdefSMauro Carvalho Chehab             int popularity;
507e548cdefSMauro Carvalho Chehab     };
508e548cdefSMauro Carvalho Chehab
509e548cdefSMauro Carvalho Chehab    -static DEFINE_MUTEX(cache_lock);
510e548cdefSMauro Carvalho Chehab    +static DEFINE_SPINLOCK(cache_lock);
511e548cdefSMauro Carvalho Chehab     static LIST_HEAD(cache);
512e548cdefSMauro Carvalho Chehab     static unsigned int cache_num = 0;
513e548cdefSMauro Carvalho Chehab     #define MAX_CACHE_SIZE 10
514e548cdefSMauro Carvalho Chehab    @@ -55,6 +55,7 @@
515e548cdefSMauro Carvalho Chehab     int cache_add(int id, const char *name)
516e548cdefSMauro Carvalho Chehab     {
517e548cdefSMauro Carvalho Chehab             struct object *obj;
518e548cdefSMauro Carvalho Chehab    +        unsigned long flags;
519e548cdefSMauro Carvalho Chehab
520e548cdefSMauro Carvalho Chehab             if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
521e548cdefSMauro Carvalho Chehab                     return -ENOMEM;
522e548cdefSMauro Carvalho Chehab    @@ -63,30 +64,33 @@
523e548cdefSMauro Carvalho Chehab             obj->id = id;
524e548cdefSMauro Carvalho Chehab             obj->popularity = 0;
525e548cdefSMauro Carvalho Chehab
526e548cdefSMauro Carvalho Chehab    -        mutex_lock(&cache_lock);
527e548cdefSMauro Carvalho Chehab    +        spin_lock_irqsave(&cache_lock, flags);
528e548cdefSMauro Carvalho Chehab             __cache_add(obj);
529e548cdefSMauro Carvalho Chehab    -        mutex_unlock(&cache_lock);
530e548cdefSMauro Carvalho Chehab    +        spin_unlock_irqrestore(&cache_lock, flags);
531e548cdefSMauro Carvalho Chehab             return 0;
532e548cdefSMauro Carvalho Chehab     }
533e548cdefSMauro Carvalho Chehab
534e548cdefSMauro Carvalho Chehab     void cache_delete(int id)
535e548cdefSMauro Carvalho Chehab     {
536e548cdefSMauro Carvalho Chehab    -        mutex_lock(&cache_lock);
537e548cdefSMauro Carvalho Chehab    +        unsigned long flags;
538e548cdefSMauro Carvalho Chehab    +
539e548cdefSMauro Carvalho Chehab    +        spin_lock_irqsave(&cache_lock, flags);
540e548cdefSMauro Carvalho Chehab             __cache_delete(__cache_find(id));
541e548cdefSMauro Carvalho Chehab    -        mutex_unlock(&cache_lock);
542e548cdefSMauro Carvalho Chehab    +        spin_unlock_irqrestore(&cache_lock, flags);
543e548cdefSMauro Carvalho Chehab     }
544e548cdefSMauro Carvalho Chehab
545e548cdefSMauro Carvalho Chehab     int cache_find(int id, char *name)
546e548cdefSMauro Carvalho Chehab     {
547e548cdefSMauro Carvalho Chehab             struct object *obj;
548e548cdefSMauro Carvalho Chehab             int ret = -ENOENT;
549e548cdefSMauro Carvalho Chehab    +        unsigned long flags;
550e548cdefSMauro Carvalho Chehab
551e548cdefSMauro Carvalho Chehab    -        mutex_lock(&cache_lock);
552e548cdefSMauro Carvalho Chehab    +        spin_lock_irqsave(&cache_lock, flags);
553e548cdefSMauro Carvalho Chehab             obj = __cache_find(id);
554e548cdefSMauro Carvalho Chehab             if (obj) {
555e548cdefSMauro Carvalho Chehab                     ret = 0;
556e548cdefSMauro Carvalho Chehab                     strcpy(name, obj->name);
557e548cdefSMauro Carvalho Chehab             }
558e548cdefSMauro Carvalho Chehab    -        mutex_unlock(&cache_lock);
559e548cdefSMauro Carvalho Chehab    +        spin_unlock_irqrestore(&cache_lock, flags);
560e548cdefSMauro Carvalho Chehab             return ret;
561e548cdefSMauro Carvalho Chehab     }
562e548cdefSMauro Carvalho Chehab
563b1735296SStephen BoydNote that the spin_lock_irqsave() will turn off
564e548cdefSMauro Carvalho Chehabinterrupts if they are on, otherwise does nothing (if we are already in
565e548cdefSMauro Carvalho Chehaban interrupt handler), hence these functions are safe to call from any
566e548cdefSMauro Carvalho Chehabcontext.
567e548cdefSMauro Carvalho Chehab
568b1735296SStephen BoydUnfortunately, cache_add() calls kmalloc()
569e548cdefSMauro Carvalho Chehabwith the ``GFP_KERNEL`` flag, which is only legal in user context. I
570b1735296SStephen Boydhave assumed that cache_add() is still only called in
571e548cdefSMauro Carvalho Chehabuser context, otherwise this should become a parameter to
572b1735296SStephen Boydcache_add().
573e548cdefSMauro Carvalho Chehab
574e548cdefSMauro Carvalho ChehabExposing Objects Outside This File
575e548cdefSMauro Carvalho Chehab----------------------------------
576e548cdefSMauro Carvalho Chehab
577e548cdefSMauro Carvalho ChehabIf our objects contained more information, it might not be sufficient to
578e548cdefSMauro Carvalho Chehabcopy the information in and out: other parts of the code might want to
579e548cdefSMauro Carvalho Chehabkeep pointers to these objects, for example, rather than looking up the
580e548cdefSMauro Carvalho Chehabid every time. This produces two problems.
581e548cdefSMauro Carvalho Chehab
582e548cdefSMauro Carvalho ChehabThe first problem is that we use the ``cache_lock`` to protect objects:
583e548cdefSMauro Carvalho Chehabwe'd need to make this non-static so the rest of the code can use it.
584e548cdefSMauro Carvalho ChehabThis makes locking trickier, as it is no longer all in one place.
585e548cdefSMauro Carvalho Chehab
586e548cdefSMauro Carvalho ChehabThe second problem is the lifetime problem: if another structure keeps a
587e548cdefSMauro Carvalho Chehabpointer to an object, it presumably expects that pointer to remain
588e548cdefSMauro Carvalho Chehabvalid. Unfortunately, this is only guaranteed while you hold the lock,
589b1735296SStephen Boydotherwise someone might call cache_delete() and even
590e548cdefSMauro Carvalho Chehabworse, add another object, re-using the same address.
591e548cdefSMauro Carvalho Chehab
592e548cdefSMauro Carvalho ChehabAs there is only one lock, you can't hold it forever: no-one else would
593e548cdefSMauro Carvalho Chehabget any work done.
594e548cdefSMauro Carvalho Chehab
595e548cdefSMauro Carvalho ChehabThe solution to this problem is to use a reference count: everyone who
596e548cdefSMauro Carvalho Chehabhas a pointer to the object increases it when they first get the object,
597e548cdefSMauro Carvalho Chehaband drops the reference count when they're finished with it. Whoever
598e548cdefSMauro Carvalho Chehabdrops it to zero knows it is unused, and can actually delete it.
599e548cdefSMauro Carvalho Chehab
600e548cdefSMauro Carvalho ChehabHere is the code::
601e548cdefSMauro Carvalho Chehab
602e548cdefSMauro Carvalho Chehab    --- cache.c.interrupt   2003-12-09 14:25:43.000000000 +1100
603e548cdefSMauro Carvalho Chehab    +++ cache.c.refcnt  2003-12-09 14:33:05.000000000 +1100
604e548cdefSMauro Carvalho Chehab    @@ -7,6 +7,7 @@
605e548cdefSMauro Carvalho Chehab     struct object
606e548cdefSMauro Carvalho Chehab     {
607e548cdefSMauro Carvalho Chehab             struct list_head list;
608e548cdefSMauro Carvalho Chehab    +        unsigned int refcnt;
609e548cdefSMauro Carvalho Chehab             int id;
610e548cdefSMauro Carvalho Chehab             char name[32];
611e548cdefSMauro Carvalho Chehab             int popularity;
612e548cdefSMauro Carvalho Chehab    @@ -17,6 +18,35 @@
613e548cdefSMauro Carvalho Chehab     static unsigned int cache_num = 0;
614e548cdefSMauro Carvalho Chehab     #define MAX_CACHE_SIZE 10
615e548cdefSMauro Carvalho Chehab
616e548cdefSMauro Carvalho Chehab    +static void __object_put(struct object *obj)
617e548cdefSMauro Carvalho Chehab    +{
618e548cdefSMauro Carvalho Chehab    +        if (--obj->refcnt == 0)
619e548cdefSMauro Carvalho Chehab    +                kfree(obj);
620e548cdefSMauro Carvalho Chehab    +}
621e548cdefSMauro Carvalho Chehab    +
622e548cdefSMauro Carvalho Chehab    +static void __object_get(struct object *obj)
623e548cdefSMauro Carvalho Chehab    +{
624e548cdefSMauro Carvalho Chehab    +        obj->refcnt++;
625e548cdefSMauro Carvalho Chehab    +}
626e548cdefSMauro Carvalho Chehab    +
627e548cdefSMauro Carvalho Chehab    +void object_put(struct object *obj)
628e548cdefSMauro Carvalho Chehab    +{
629e548cdefSMauro Carvalho Chehab    +        unsigned long flags;
630e548cdefSMauro Carvalho Chehab    +
631e548cdefSMauro Carvalho Chehab    +        spin_lock_irqsave(&cache_lock, flags);
632e548cdefSMauro Carvalho Chehab    +        __object_put(obj);
633e548cdefSMauro Carvalho Chehab    +        spin_unlock_irqrestore(&cache_lock, flags);
634e548cdefSMauro Carvalho Chehab    +}
635e548cdefSMauro Carvalho Chehab    +
636e548cdefSMauro Carvalho Chehab    +void object_get(struct object *obj)
637e548cdefSMauro Carvalho Chehab    +{
638e548cdefSMauro Carvalho Chehab    +        unsigned long flags;
639e548cdefSMauro Carvalho Chehab    +
640e548cdefSMauro Carvalho Chehab    +        spin_lock_irqsave(&cache_lock, flags);
641e548cdefSMauro Carvalho Chehab    +        __object_get(obj);
642e548cdefSMauro Carvalho Chehab    +        spin_unlock_irqrestore(&cache_lock, flags);
643e548cdefSMauro Carvalho Chehab    +}
644e548cdefSMauro Carvalho Chehab    +
645e548cdefSMauro Carvalho Chehab     /* Must be holding cache_lock */
646e548cdefSMauro Carvalho Chehab     static struct object *__cache_find(int id)
647e548cdefSMauro Carvalho Chehab     {
648e548cdefSMauro Carvalho Chehab    @@ -35,6 +65,7 @@
649e548cdefSMauro Carvalho Chehab     {
650e548cdefSMauro Carvalho Chehab             BUG_ON(!obj);
651e548cdefSMauro Carvalho Chehab             list_del(&obj->list);
652e548cdefSMauro Carvalho Chehab    +        __object_put(obj);
653e548cdefSMauro Carvalho Chehab             cache_num--;
654e548cdefSMauro Carvalho Chehab     }
655e548cdefSMauro Carvalho Chehab
656e548cdefSMauro Carvalho Chehab    @@ -63,6 +94,7 @@
657220ee02aSStephen Kitt             strscpy(obj->name, name, sizeof(obj->name));
658e548cdefSMauro Carvalho Chehab             obj->id = id;
659e548cdefSMauro Carvalho Chehab             obj->popularity = 0;
660e548cdefSMauro Carvalho Chehab    +        obj->refcnt = 1; /* The cache holds a reference */
661e548cdefSMauro Carvalho Chehab
662e548cdefSMauro Carvalho Chehab             spin_lock_irqsave(&cache_lock, flags);
663e548cdefSMauro Carvalho Chehab             __cache_add(obj);
664e548cdefSMauro Carvalho Chehab    @@ -79,18 +111,15 @@
665e548cdefSMauro Carvalho Chehab             spin_unlock_irqrestore(&cache_lock, flags);
666e548cdefSMauro Carvalho Chehab     }
667e548cdefSMauro Carvalho Chehab
668e548cdefSMauro Carvalho Chehab    -int cache_find(int id, char *name)
669e548cdefSMauro Carvalho Chehab    +struct object *cache_find(int id)
670e548cdefSMauro Carvalho Chehab     {
671e548cdefSMauro Carvalho Chehab             struct object *obj;
672e548cdefSMauro Carvalho Chehab    -        int ret = -ENOENT;
673e548cdefSMauro Carvalho Chehab             unsigned long flags;
674e548cdefSMauro Carvalho Chehab
675e548cdefSMauro Carvalho Chehab             spin_lock_irqsave(&cache_lock, flags);
676e548cdefSMauro Carvalho Chehab             obj = __cache_find(id);
677e548cdefSMauro Carvalho Chehab    -        if (obj) {
678e548cdefSMauro Carvalho Chehab    -                ret = 0;
679e548cdefSMauro Carvalho Chehab    -                strcpy(name, obj->name);
680e548cdefSMauro Carvalho Chehab    -        }
681e548cdefSMauro Carvalho Chehab    +        if (obj)
682e548cdefSMauro Carvalho Chehab    +                __object_get(obj);
683e548cdefSMauro Carvalho Chehab             spin_unlock_irqrestore(&cache_lock, flags);
684e548cdefSMauro Carvalho Chehab    -        return ret;
685e548cdefSMauro Carvalho Chehab    +        return obj;
686e548cdefSMauro Carvalho Chehab     }
687e548cdefSMauro Carvalho Chehab
688e548cdefSMauro Carvalho ChehabWe encapsulate the reference counting in the standard 'get' and 'put'
689e548cdefSMauro Carvalho Chehabfunctions. Now we can return the object itself from
690b1735296SStephen Boydcache_find() which has the advantage that the user can
691b1735296SStephen Boydnow sleep holding the object (eg. to copy_to_user() to
692e548cdefSMauro Carvalho Chehabname to userspace).
693e548cdefSMauro Carvalho Chehab
694e548cdefSMauro Carvalho ChehabThe other point to note is that I said a reference should be held for
695e548cdefSMauro Carvalho Chehabevery pointer to the object: thus the reference count is 1 when first
696e548cdefSMauro Carvalho Chehabinserted into the cache. In some versions the framework does not hold a
697e548cdefSMauro Carvalho Chehabreference count, but they are more complicated.
698e548cdefSMauro Carvalho Chehab
699e548cdefSMauro Carvalho ChehabUsing Atomic Operations For The Reference Count
700e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
701e548cdefSMauro Carvalho Chehab
702dc89fca9SMauro Carvalho ChehabIn practice, :c:type:`atomic_t` would usually be used for refcnt. There are a
703e548cdefSMauro Carvalho Chehabnumber of atomic operations defined in ``include/asm/atomic.h``: these
704e548cdefSMauro Carvalho Chehabare guaranteed to be seen atomically from all CPUs in the system, so no
705e548cdefSMauro Carvalho Chehablock is required. In this case, it is simpler than using spinlocks,
706e548cdefSMauro Carvalho Chehabalthough for anything non-trivial using spinlocks is clearer. The
707b1735296SStephen Boydatomic_inc() and atomic_dec_and_test()
708e548cdefSMauro Carvalho Chehabare used instead of the standard increment and decrement operators, and
709e548cdefSMauro Carvalho Chehabthe lock is no longer used to protect the reference count itself.
710e548cdefSMauro Carvalho Chehab
711e548cdefSMauro Carvalho Chehab::
712e548cdefSMauro Carvalho Chehab
713e548cdefSMauro Carvalho Chehab    --- cache.c.refcnt  2003-12-09 15:00:35.000000000 +1100
714e548cdefSMauro Carvalho Chehab    +++ cache.c.refcnt-atomic   2003-12-11 15:49:42.000000000 +1100
715e548cdefSMauro Carvalho Chehab    @@ -7,7 +7,7 @@
716e548cdefSMauro Carvalho Chehab     struct object
717e548cdefSMauro Carvalho Chehab     {
718e548cdefSMauro Carvalho Chehab             struct list_head list;
719e548cdefSMauro Carvalho Chehab    -        unsigned int refcnt;
720e548cdefSMauro Carvalho Chehab    +        atomic_t refcnt;
721e548cdefSMauro Carvalho Chehab             int id;
722e548cdefSMauro Carvalho Chehab             char name[32];
723e548cdefSMauro Carvalho Chehab             int popularity;
724e548cdefSMauro Carvalho Chehab    @@ -18,33 +18,15 @@
725e548cdefSMauro Carvalho Chehab     static unsigned int cache_num = 0;
726e548cdefSMauro Carvalho Chehab     #define MAX_CACHE_SIZE 10
727e548cdefSMauro Carvalho Chehab
728e548cdefSMauro Carvalho Chehab    -static void __object_put(struct object *obj)
729e548cdefSMauro Carvalho Chehab    -{
730e548cdefSMauro Carvalho Chehab    -        if (--obj->refcnt == 0)
731e548cdefSMauro Carvalho Chehab    -                kfree(obj);
732e548cdefSMauro Carvalho Chehab    -}
733e548cdefSMauro Carvalho Chehab    -
734e548cdefSMauro Carvalho Chehab    -static void __object_get(struct object *obj)
735e548cdefSMauro Carvalho Chehab    -{
736e548cdefSMauro Carvalho Chehab    -        obj->refcnt++;
737e548cdefSMauro Carvalho Chehab    -}
738e548cdefSMauro Carvalho Chehab    -
739e548cdefSMauro Carvalho Chehab     void object_put(struct object *obj)
740e548cdefSMauro Carvalho Chehab     {
741e548cdefSMauro Carvalho Chehab    -        unsigned long flags;
742e548cdefSMauro Carvalho Chehab    -
743e548cdefSMauro Carvalho Chehab    -        spin_lock_irqsave(&cache_lock, flags);
744e548cdefSMauro Carvalho Chehab    -        __object_put(obj);
745e548cdefSMauro Carvalho Chehab    -        spin_unlock_irqrestore(&cache_lock, flags);
746e548cdefSMauro Carvalho Chehab    +        if (atomic_dec_and_test(&obj->refcnt))
747e548cdefSMauro Carvalho Chehab    +                kfree(obj);
748e548cdefSMauro Carvalho Chehab     }
749e548cdefSMauro Carvalho Chehab
750e548cdefSMauro Carvalho Chehab     void object_get(struct object *obj)
751e548cdefSMauro Carvalho Chehab     {
752e548cdefSMauro Carvalho Chehab    -        unsigned long flags;
753e548cdefSMauro Carvalho Chehab    -
754e548cdefSMauro Carvalho Chehab    -        spin_lock_irqsave(&cache_lock, flags);
755e548cdefSMauro Carvalho Chehab    -        __object_get(obj);
756e548cdefSMauro Carvalho Chehab    -        spin_unlock_irqrestore(&cache_lock, flags);
757e548cdefSMauro Carvalho Chehab    +        atomic_inc(&obj->refcnt);
758e548cdefSMauro Carvalho Chehab     }
759e548cdefSMauro Carvalho Chehab
760e548cdefSMauro Carvalho Chehab     /* Must be holding cache_lock */
761e548cdefSMauro Carvalho Chehab    @@ -65,7 +47,7 @@
762e548cdefSMauro Carvalho Chehab     {
763e548cdefSMauro Carvalho Chehab             BUG_ON(!obj);
764e548cdefSMauro Carvalho Chehab             list_del(&obj->list);
765e548cdefSMauro Carvalho Chehab    -        __object_put(obj);
766e548cdefSMauro Carvalho Chehab    +        object_put(obj);
767e548cdefSMauro Carvalho Chehab             cache_num--;
768e548cdefSMauro Carvalho Chehab     }
769e548cdefSMauro Carvalho Chehab
770e548cdefSMauro Carvalho Chehab    @@ -94,7 +76,7 @@
771220ee02aSStephen Kitt             strscpy(obj->name, name, sizeof(obj->name));
772e548cdefSMauro Carvalho Chehab             obj->id = id;
773e548cdefSMauro Carvalho Chehab             obj->popularity = 0;
774e548cdefSMauro Carvalho Chehab    -        obj->refcnt = 1; /* The cache holds a reference */
775e548cdefSMauro Carvalho Chehab    +        atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
776e548cdefSMauro Carvalho Chehab
777e548cdefSMauro Carvalho Chehab             spin_lock_irqsave(&cache_lock, flags);
778e548cdefSMauro Carvalho Chehab             __cache_add(obj);
779e548cdefSMauro Carvalho Chehab    @@ -119,7 +101,7 @@
780e548cdefSMauro Carvalho Chehab             spin_lock_irqsave(&cache_lock, flags);
781e548cdefSMauro Carvalho Chehab             obj = __cache_find(id);
782e548cdefSMauro Carvalho Chehab             if (obj)
783e548cdefSMauro Carvalho Chehab    -                __object_get(obj);
784e548cdefSMauro Carvalho Chehab    +                object_get(obj);
785e548cdefSMauro Carvalho Chehab             spin_unlock_irqrestore(&cache_lock, flags);
786e548cdefSMauro Carvalho Chehab             return obj;
787e548cdefSMauro Carvalho Chehab     }
788e548cdefSMauro Carvalho Chehab
789e548cdefSMauro Carvalho ChehabProtecting The Objects Themselves
790e548cdefSMauro Carvalho Chehab---------------------------------
791e548cdefSMauro Carvalho Chehab
792e548cdefSMauro Carvalho ChehabIn these examples, we assumed that the objects (except the reference
793e548cdefSMauro Carvalho Chehabcounts) never changed once they are created. If we wanted to allow the
794e548cdefSMauro Carvalho Chehabname to change, there are three possibilities:
795e548cdefSMauro Carvalho Chehab
796e548cdefSMauro Carvalho Chehab-  You can make ``cache_lock`` non-static, and tell people to grab that
797e548cdefSMauro Carvalho Chehab   lock before changing the name in any object.
798e548cdefSMauro Carvalho Chehab
799b1735296SStephen Boyd-  You can provide a cache_obj_rename() which grabs this
800e548cdefSMauro Carvalho Chehab   lock and changes the name for the caller, and tell everyone to use
801e548cdefSMauro Carvalho Chehab   that function.
802e548cdefSMauro Carvalho Chehab
803e548cdefSMauro Carvalho Chehab-  You can make the ``cache_lock`` protect only the cache itself, and
804e548cdefSMauro Carvalho Chehab   use another lock to protect the name.
805e548cdefSMauro Carvalho Chehab
806e548cdefSMauro Carvalho ChehabTheoretically, you can make the locks as fine-grained as one lock for
807e548cdefSMauro Carvalho Chehabevery field, for every object. In practice, the most common variants
808e548cdefSMauro Carvalho Chehabare:
809e548cdefSMauro Carvalho Chehab
810e548cdefSMauro Carvalho Chehab-  One lock which protects the infrastructure (the ``cache`` list in
811e548cdefSMauro Carvalho Chehab   this example) and all the objects. This is what we have done so far.
812e548cdefSMauro Carvalho Chehab
813e548cdefSMauro Carvalho Chehab-  One lock which protects the infrastructure (including the list
814e548cdefSMauro Carvalho Chehab   pointers inside the objects), and one lock inside the object which
815e548cdefSMauro Carvalho Chehab   protects the rest of that object.
816e548cdefSMauro Carvalho Chehab
817e548cdefSMauro Carvalho Chehab-  Multiple locks to protect the infrastructure (eg. one lock per hash
818e548cdefSMauro Carvalho Chehab   chain), possibly with a separate per-object lock.
819e548cdefSMauro Carvalho Chehab
820e548cdefSMauro Carvalho ChehabHere is the "lock-per-object" implementation:
821e548cdefSMauro Carvalho Chehab
822e548cdefSMauro Carvalho Chehab::
823e548cdefSMauro Carvalho Chehab
824e548cdefSMauro Carvalho Chehab    --- cache.c.refcnt-atomic   2003-12-11 15:50:54.000000000 +1100
825e548cdefSMauro Carvalho Chehab    +++ cache.c.perobjectlock   2003-12-11 17:15:03.000000000 +1100
826e548cdefSMauro Carvalho Chehab    @@ -6,11 +6,17 @@
827e548cdefSMauro Carvalho Chehab
828e548cdefSMauro Carvalho Chehab     struct object
829e548cdefSMauro Carvalho Chehab     {
830e548cdefSMauro Carvalho Chehab    +        /* These two protected by cache_lock. */
831e548cdefSMauro Carvalho Chehab             struct list_head list;
832e548cdefSMauro Carvalho Chehab    +        int popularity;
833e548cdefSMauro Carvalho Chehab    +
834e548cdefSMauro Carvalho Chehab             atomic_t refcnt;
835e548cdefSMauro Carvalho Chehab    +
836e548cdefSMauro Carvalho Chehab    +        /* Doesn't change once created. */
837e548cdefSMauro Carvalho Chehab             int id;
838e548cdefSMauro Carvalho Chehab    +
839e548cdefSMauro Carvalho Chehab    +        spinlock_t lock; /* Protects the name */
840e548cdefSMauro Carvalho Chehab             char name[32];
841e548cdefSMauro Carvalho Chehab    -        int popularity;
842e548cdefSMauro Carvalho Chehab     };
843e548cdefSMauro Carvalho Chehab
844e548cdefSMauro Carvalho Chehab     static DEFINE_SPINLOCK(cache_lock);
845e548cdefSMauro Carvalho Chehab    @@ -77,6 +84,7 @@
846e548cdefSMauro Carvalho Chehab             obj->id = id;
847e548cdefSMauro Carvalho Chehab             obj->popularity = 0;
848e548cdefSMauro Carvalho Chehab             atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
849e548cdefSMauro Carvalho Chehab    +        spin_lock_init(&obj->lock);
850e548cdefSMauro Carvalho Chehab
851e548cdefSMauro Carvalho Chehab             spin_lock_irqsave(&cache_lock, flags);
852e548cdefSMauro Carvalho Chehab             __cache_add(obj);
853e548cdefSMauro Carvalho Chehab
854e548cdefSMauro Carvalho ChehabNote that I decide that the popularity count should be protected by the
855e548cdefSMauro Carvalho Chehab``cache_lock`` rather than the per-object lock: this is because it (like
856e548cdefSMauro Carvalho Chehabthe :c:type:`struct list_head <list_head>` inside the object)
857e548cdefSMauro Carvalho Chehabis logically part of the infrastructure. This way, I don't need to grab
858b1735296SStephen Boydthe lock of every object in __cache_add() when seeking
859e548cdefSMauro Carvalho Chehabthe least popular.
860e548cdefSMauro Carvalho Chehab
861e548cdefSMauro Carvalho ChehabI also decided that the id member is unchangeable, so I don't need to
862b1735296SStephen Boydgrab each object lock in __cache_find() to examine the
863e548cdefSMauro Carvalho Chehabid: the object lock is only used by a caller who wants to read or write
864e548cdefSMauro Carvalho Chehabthe name field.
865e548cdefSMauro Carvalho Chehab
866e548cdefSMauro Carvalho ChehabNote also that I added a comment describing what data was protected by
867e548cdefSMauro Carvalho Chehabwhich locks. This is extremely important, as it describes the runtime
868e548cdefSMauro Carvalho Chehabbehavior of the code, and can be hard to gain from just reading. And as
869e548cdefSMauro Carvalho ChehabAlan Cox says, “Lock data, not code”.
870e548cdefSMauro Carvalho Chehab
871e548cdefSMauro Carvalho ChehabCommon Problems
872e548cdefSMauro Carvalho Chehab===============
873e548cdefSMauro Carvalho Chehab
874e548cdefSMauro Carvalho ChehabDeadlock: Simple and Advanced
875e548cdefSMauro Carvalho Chehab-----------------------------
876e548cdefSMauro Carvalho Chehab
877e548cdefSMauro Carvalho ChehabThere is a coding bug where a piece of code tries to grab a spinlock
878e548cdefSMauro Carvalho Chehabtwice: it will spin forever, waiting for the lock to be released
879e548cdefSMauro Carvalho Chehab(spinlocks, rwlocks and mutexes are not recursive in Linux). This is
880e548cdefSMauro Carvalho Chehabtrivial to diagnose: not a
881e548cdefSMauro Carvalho Chehabstay-up-five-nights-talk-to-fluffy-code-bunnies kind of problem.
882e548cdefSMauro Carvalho Chehab
883e548cdefSMauro Carvalho ChehabFor a slightly more complex case, imagine you have a region shared by a
884b1735296SStephen Boydsoftirq and user context. If you use a spin_lock() call
885e548cdefSMauro Carvalho Chehabto protect it, it is possible that the user context will be interrupted
886e548cdefSMauro Carvalho Chehabby the softirq while it holds the lock, and the softirq will then spin
887e548cdefSMauro Carvalho Chehabforever trying to get the same lock.
888e548cdefSMauro Carvalho Chehab
889e548cdefSMauro Carvalho ChehabBoth of these are called deadlock, and as shown above, it can occur even
890e548cdefSMauro Carvalho Chehabwith a single CPU (although not on UP compiles, since spinlocks vanish
891e548cdefSMauro Carvalho Chehabon kernel compiles with ``CONFIG_SMP``\ =n. You'll still get data
892e548cdefSMauro Carvalho Chehabcorruption in the second example).
893e548cdefSMauro Carvalho Chehab
894e548cdefSMauro Carvalho ChehabThis complete lockup is easy to diagnose: on SMP boxes the watchdog
895e548cdefSMauro Carvalho Chehabtimer or compiling with ``DEBUG_SPINLOCK`` set
896e548cdefSMauro Carvalho Chehab(``include/linux/spinlock.h``) will show this up immediately when it
897e548cdefSMauro Carvalho Chehabhappens.
898e548cdefSMauro Carvalho Chehab
899e548cdefSMauro Carvalho ChehabA more complex problem is the so-called 'deadly embrace', involving two
900e548cdefSMauro Carvalho Chehabor more locks. Say you have a hash table: each entry in the table is a
901e548cdefSMauro Carvalho Chehabspinlock, and a chain of hashed objects. Inside a softirq handler, you
902e548cdefSMauro Carvalho Chehabsometimes want to alter an object from one place in the hash to another:
903e548cdefSMauro Carvalho Chehabyou grab the spinlock of the old hash chain and the spinlock of the new
904e548cdefSMauro Carvalho Chehabhash chain, and delete the object from the old one, and insert it in the
905e548cdefSMauro Carvalho Chehabnew one.
906e548cdefSMauro Carvalho Chehab
907e548cdefSMauro Carvalho ChehabThere are two problems here. First, if your code ever tries to move the
908e548cdefSMauro Carvalho Chehabobject to the same chain, it will deadlock with itself as it tries to
909e548cdefSMauro Carvalho Chehablock it twice. Secondly, if the same softirq on another CPU is trying to
910e548cdefSMauro Carvalho Chehabmove another object in the reverse direction, the following could
911e548cdefSMauro Carvalho Chehabhappen:
912e548cdefSMauro Carvalho Chehab
913e548cdefSMauro Carvalho Chehab+-----------------------+-----------------------+
914e548cdefSMauro Carvalho Chehab| CPU 1                 | CPU 2                 |
915e548cdefSMauro Carvalho Chehab+=======================+=======================+
916e548cdefSMauro Carvalho Chehab| Grab lock A -> OK     | Grab lock B -> OK     |
917e548cdefSMauro Carvalho Chehab+-----------------------+-----------------------+
918e548cdefSMauro Carvalho Chehab| Grab lock B -> spin   | Grab lock A -> spin   |
919e548cdefSMauro Carvalho Chehab+-----------------------+-----------------------+
920e548cdefSMauro Carvalho Chehab
921e548cdefSMauro Carvalho ChehabTable: Consequences
922e548cdefSMauro Carvalho Chehab
923e548cdefSMauro Carvalho ChehabThe two CPUs will spin forever, waiting for the other to give up their
924e548cdefSMauro Carvalho Chehablock. It will look, smell, and feel like a crash.
925e548cdefSMauro Carvalho Chehab
926e548cdefSMauro Carvalho ChehabPreventing Deadlock
927e548cdefSMauro Carvalho Chehab-------------------
928e548cdefSMauro Carvalho Chehab
929e548cdefSMauro Carvalho ChehabTextbooks will tell you that if you always lock in the same order, you
930e548cdefSMauro Carvalho Chehabwill never get this kind of deadlock. Practice will tell you that this
931e548cdefSMauro Carvalho Chehabapproach doesn't scale: when I create a new lock, I don't understand
932e548cdefSMauro Carvalho Chehabenough of the kernel to figure out where in the 5000 lock hierarchy it
933e548cdefSMauro Carvalho Chehabwill fit.
934e548cdefSMauro Carvalho Chehab
935e548cdefSMauro Carvalho ChehabThe best locks are encapsulated: they never get exposed in headers, and
936e548cdefSMauro Carvalho Chehabare never held around calls to non-trivial functions outside the same
937e548cdefSMauro Carvalho Chehabfile. You can read through this code and see that it will never
938e548cdefSMauro Carvalho Chehabdeadlock, because it never tries to grab another lock while it has that
939e548cdefSMauro Carvalho Chehabone. People using your code don't even need to know you are using a
940e548cdefSMauro Carvalho Chehablock.
941e548cdefSMauro Carvalho Chehab
942e548cdefSMauro Carvalho ChehabA classic problem here is when you provide callbacks or hooks: if you
943e548cdefSMauro Carvalho Chehabcall these with the lock held, you risk simple deadlock, or a deadly
944f35cf1a5SKonstantin Ryabitsevembrace (who knows what the callback will do?).
945e548cdefSMauro Carvalho Chehab
946e548cdefSMauro Carvalho ChehabOverzealous Prevention Of Deadlocks
947e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
948e548cdefSMauro Carvalho Chehab
949e548cdefSMauro Carvalho ChehabDeadlocks are problematic, but not as bad as data corruption. Code which
950e548cdefSMauro Carvalho Chehabgrabs a read lock, searches a list, fails to find what it wants, drops
951e548cdefSMauro Carvalho Chehabthe read lock, grabs a write lock and inserts the object has a race
952e548cdefSMauro Carvalho Chehabcondition.
953e548cdefSMauro Carvalho Chehab
954e548cdefSMauro Carvalho ChehabRacing Timers: A Kernel Pastime
955e548cdefSMauro Carvalho Chehab-------------------------------
956e548cdefSMauro Carvalho Chehab
957e548cdefSMauro Carvalho ChehabTimers can produce their own special problems with races. Consider a
958e548cdefSMauro Carvalho Chehabcollection of objects (list, hash, etc) where each object has a timer
959e548cdefSMauro Carvalho Chehabwhich is due to destroy it.
960e548cdefSMauro Carvalho Chehab
961e548cdefSMauro Carvalho ChehabIf you want to destroy the entire collection (say on module removal),
962e548cdefSMauro Carvalho Chehabyou might do the following::
963e548cdefSMauro Carvalho Chehab
964e548cdefSMauro Carvalho Chehab            /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE
965e548cdefSMauro Carvalho Chehab               HUNGARIAN NOTATION */
966e548cdefSMauro Carvalho Chehab            spin_lock_bh(&list_lock);
967e548cdefSMauro Carvalho Chehab
968e548cdefSMauro Carvalho Chehab            while (list) {
969e548cdefSMauro Carvalho Chehab                    struct foo *next = list->next;
97087bdd932SThomas Gleixner                    timer_delete(&list->timer);
971e548cdefSMauro Carvalho Chehab                    kfree(list);
972e548cdefSMauro Carvalho Chehab                    list = next;
973e548cdefSMauro Carvalho Chehab            }
974e548cdefSMauro Carvalho Chehab
975e548cdefSMauro Carvalho Chehab            spin_unlock_bh(&list_lock);
976e548cdefSMauro Carvalho Chehab
977e548cdefSMauro Carvalho Chehab
978e548cdefSMauro Carvalho ChehabSooner or later, this will crash on SMP, because a timer can have just
979b1735296SStephen Boydgone off before the spin_lock_bh(), and it will only get
980b1735296SStephen Boydthe lock after we spin_unlock_bh(), and then try to free
981e548cdefSMauro Carvalho Chehabthe element (which has already been freed!).
982e548cdefSMauro Carvalho Chehab
983e548cdefSMauro Carvalho ChehabThis can be avoided by checking the result of
98487bdd932SThomas Gleixnertimer_delete(): if it returns 1, the timer has been deleted.
985e548cdefSMauro Carvalho ChehabIf 0, it means (in this case) that it is currently running, so we can
986e548cdefSMauro Carvalho Chehabdo::
987e548cdefSMauro Carvalho Chehab
988e548cdefSMauro Carvalho Chehab            retry:
989e548cdefSMauro Carvalho Chehab                    spin_lock_bh(&list_lock);
990e548cdefSMauro Carvalho Chehab
991e548cdefSMauro Carvalho Chehab                    while (list) {
992e548cdefSMauro Carvalho Chehab                            struct foo *next = list->next;
99387bdd932SThomas Gleixner                            if (!timer_delete(&list->timer)) {
994e548cdefSMauro Carvalho Chehab                                    /* Give timer a chance to delete this */
995e548cdefSMauro Carvalho Chehab                                    spin_unlock_bh(&list_lock);
996e548cdefSMauro Carvalho Chehab                                    goto retry;
997e548cdefSMauro Carvalho Chehab                            }
998e548cdefSMauro Carvalho Chehab                            kfree(list);
999e548cdefSMauro Carvalho Chehab                            list = next;
1000e548cdefSMauro Carvalho Chehab                    }
1001e548cdefSMauro Carvalho Chehab
1002e548cdefSMauro Carvalho Chehab                    spin_unlock_bh(&list_lock);
1003e548cdefSMauro Carvalho Chehab
1004e548cdefSMauro Carvalho Chehab
1005e548cdefSMauro Carvalho ChehabAnother common problem is deleting timers which restart themselves (by
1006b1735296SStephen Boydcalling add_timer() at the end of their timer function).
1007e548cdefSMauro Carvalho ChehabBecause this is a fairly common case which is prone to races, you should
100887bdd932SThomas Gleixneruse timer_delete_sync() (``include/linux/timer.h``) to handle this case.
1009e548cdefSMauro Carvalho Chehab
1010a31323beSSteven Rostedt (Google)Before freeing a timer, timer_shutdown() or timer_shutdown_sync() should be
1011a31323beSSteven Rostedt (Google)called which will keep it from being rearmed. Any subsequent attempt to
1012a31323beSSteven Rostedt (Google)rearm the timer will be silently ignored by the core code.
1013a31323beSSteven Rostedt (Google)
1014a31323beSSteven Rostedt (Google)
1015e548cdefSMauro Carvalho ChehabLocking Speed
1016e548cdefSMauro Carvalho Chehab=============
1017e548cdefSMauro Carvalho Chehab
1018e548cdefSMauro Carvalho ChehabThere are three main things to worry about when considering speed of
1019e548cdefSMauro Carvalho Chehabsome code which does locking. First is concurrency: how many things are
1020e548cdefSMauro Carvalho Chehabgoing to be waiting while someone else is holding a lock. Second is the
1021e548cdefSMauro Carvalho Chehabtime taken to actually acquire and release an uncontended lock. Third is
1022e548cdefSMauro Carvalho Chehabusing fewer, or smarter locks. I'm assuming that the lock is used fairly
1023e548cdefSMauro Carvalho Chehaboften: otherwise, you wouldn't be concerned about efficiency.
1024e548cdefSMauro Carvalho Chehab
1025e548cdefSMauro Carvalho ChehabConcurrency depends on how long the lock is usually held: you should
1026e548cdefSMauro Carvalho Chehabhold the lock for as long as needed, but no longer. In the cache
1027e548cdefSMauro Carvalho Chehabexample, we always create the object without the lock held, and then
1028e548cdefSMauro Carvalho Chehabgrab the lock only when we are ready to insert it in the list.
1029e548cdefSMauro Carvalho Chehab
1030e548cdefSMauro Carvalho ChehabAcquisition times depend on how much damage the lock operations do to
1031e548cdefSMauro Carvalho Chehabthe pipeline (pipeline stalls) and how likely it is that this CPU was
1032e548cdefSMauro Carvalho Chehabthe last one to grab the lock (ie. is the lock cache-hot for this CPU):
1033e548cdefSMauro Carvalho Chehabon a machine with more CPUs, this likelihood drops fast. Consider a
1034e548cdefSMauro Carvalho Chehab700MHz Intel Pentium III: an instruction takes about 0.7ns, an atomic
1035e548cdefSMauro Carvalho Chehabincrement takes about 58ns, a lock which is cache-hot on this CPU takes
1036e548cdefSMauro Carvalho Chehab160ns, and a cacheline transfer from another CPU takes an additional 170
1037e548cdefSMauro Carvalho Chehabto 360ns. (These figures from Paul McKenney's `Linux Journal RCU
1038e548cdefSMauro Carvalho Chehabarticle <http://www.linuxjournal.com/article.php?sid=6993>`__).
1039e548cdefSMauro Carvalho Chehab
1040e548cdefSMauro Carvalho ChehabThese two aims conflict: holding a lock for a short time might be done
1041e548cdefSMauro Carvalho Chehabby splitting locks into parts (such as in our final per-object-lock
1042e548cdefSMauro Carvalho Chehabexample), but this increases the number of lock acquisitions, and the
1043e548cdefSMauro Carvalho Chehabresults are often slower than having a single lock. This is another
1044e548cdefSMauro Carvalho Chehabreason to advocate locking simplicity.
1045e548cdefSMauro Carvalho Chehab
1046e548cdefSMauro Carvalho ChehabThe third concern is addressed below: there are some methods to reduce
1047e548cdefSMauro Carvalho Chehabthe amount of locking which needs to be done.
1048e548cdefSMauro Carvalho Chehab
1049e548cdefSMauro Carvalho ChehabRead/Write Lock Variants
1050e548cdefSMauro Carvalho Chehab------------------------
1051e548cdefSMauro Carvalho Chehab
1052e548cdefSMauro Carvalho ChehabBoth spinlocks and mutexes have read/write variants: ``rwlock_t`` and
1053e548cdefSMauro Carvalho Chehab:c:type:`struct rw_semaphore <rw_semaphore>`. These divide
1054e548cdefSMauro Carvalho Chehabusers into two classes: the readers and the writers. If you are only
1055e548cdefSMauro Carvalho Chehabreading the data, you can get a read lock, but to write to the data you
1056e548cdefSMauro Carvalho Chehabneed the write lock. Many people can hold a read lock, but a writer must
1057e548cdefSMauro Carvalho Chehabbe sole holder.
1058e548cdefSMauro Carvalho Chehab
1059e548cdefSMauro Carvalho ChehabIf your code divides neatly along reader/writer lines (as our cache code
1060e548cdefSMauro Carvalho Chehabdoes), and the lock is held by readers for significant lengths of time,
1061e548cdefSMauro Carvalho Chehabusing these locks can help. They are slightly slower than the normal
1062e548cdefSMauro Carvalho Chehablocks though, so in practice ``rwlock_t`` is not usually worthwhile.
1063e548cdefSMauro Carvalho Chehab
1064e548cdefSMauro Carvalho ChehabAvoiding Locks: Read Copy Update
1065e548cdefSMauro Carvalho Chehab--------------------------------
1066e548cdefSMauro Carvalho Chehab
1067e548cdefSMauro Carvalho ChehabThere is a special method of read/write locking called Read Copy Update.
1068e548cdefSMauro Carvalho ChehabUsing RCU, the readers can avoid taking a lock altogether: as we expect
1069e548cdefSMauro Carvalho Chehabour cache to be read more often than updated (otherwise the cache is a
1070e548cdefSMauro Carvalho Chehabwaste of time), it is a candidate for this optimization.
1071e548cdefSMauro Carvalho Chehab
1072e548cdefSMauro Carvalho ChehabHow do we get rid of read locks? Getting rid of read locks means that
1073e548cdefSMauro Carvalho Chehabwriters may be changing the list underneath the readers. That is
1074e548cdefSMauro Carvalho Chehabactually quite simple: we can read a linked list while an element is
1075e548cdefSMauro Carvalho Chehabbeing added if the writer adds the element very carefully. For example,
1076e548cdefSMauro Carvalho Chehabadding ``new`` to a single linked list called ``list``::
1077e548cdefSMauro Carvalho Chehab
1078e548cdefSMauro Carvalho Chehab            new->next = list->next;
1079e548cdefSMauro Carvalho Chehab            wmb();
1080e548cdefSMauro Carvalho Chehab            list->next = new;
1081e548cdefSMauro Carvalho Chehab
1082e548cdefSMauro Carvalho Chehab
1083b1735296SStephen BoydThe wmb() is a write memory barrier. It ensures that the
1084e548cdefSMauro Carvalho Chehabfirst operation (setting the new element's ``next`` pointer) is complete
1085e548cdefSMauro Carvalho Chehaband will be seen by all CPUs, before the second operation is (putting
1086e548cdefSMauro Carvalho Chehabthe new element into the list). This is important, since modern
1087e548cdefSMauro Carvalho Chehabcompilers and modern CPUs can both reorder instructions unless told
1088e548cdefSMauro Carvalho Chehabotherwise: we want a reader to either not see the new element at all, or
1089e548cdefSMauro Carvalho Chehabsee the new element with the ``next`` pointer correctly pointing at the
1090e548cdefSMauro Carvalho Chehabrest of the list.
1091e548cdefSMauro Carvalho Chehab
1092e548cdefSMauro Carvalho ChehabFortunately, there is a function to do this for standard
1093e548cdefSMauro Carvalho Chehab:c:type:`struct list_head <list_head>` lists:
1094b1735296SStephen Boydlist_add_rcu() (``include/linux/list.h``).
1095e548cdefSMauro Carvalho Chehab
1096e548cdefSMauro Carvalho ChehabRemoving an element from the list is even simpler: we replace the
1097e548cdefSMauro Carvalho Chehabpointer to the old element with a pointer to its successor, and readers
1098e548cdefSMauro Carvalho Chehabwill either see it, or skip over it.
1099e548cdefSMauro Carvalho Chehab
1100e548cdefSMauro Carvalho Chehab::
1101e548cdefSMauro Carvalho Chehab
1102e548cdefSMauro Carvalho Chehab            list->next = old->next;
1103e548cdefSMauro Carvalho Chehab
1104e548cdefSMauro Carvalho Chehab
1105b1735296SStephen BoydThere is list_del_rcu() (``include/linux/list.h``) which
1106e548cdefSMauro Carvalho Chehabdoes this (the normal version poisons the old object, which we don't
1107e548cdefSMauro Carvalho Chehabwant).
1108e548cdefSMauro Carvalho Chehab
1109e548cdefSMauro Carvalho ChehabThe reader must also be careful: some CPUs can look through the ``next``
1110e548cdefSMauro Carvalho Chehabpointer to start reading the contents of the next element early, but
1111e548cdefSMauro Carvalho Chehabdon't realize that the pre-fetched contents is wrong when the ``next``
1112e548cdefSMauro Carvalho Chehabpointer changes underneath them. Once again, there is a
1113b1735296SStephen Boydlist_for_each_entry_rcu() (``include/linux/list.h``)
1114e548cdefSMauro Carvalho Chehabto help you. Of course, writers can just use
1115b1735296SStephen Boydlist_for_each_entry(), since there cannot be two
1116e548cdefSMauro Carvalho Chehabsimultaneous writers.
1117e548cdefSMauro Carvalho Chehab
1118e548cdefSMauro Carvalho ChehabOur final dilemma is this: when can we actually destroy the removed
1119e548cdefSMauro Carvalho Chehabelement? Remember, a reader might be stepping through this element in
1120e548cdefSMauro Carvalho Chehabthe list right now: if we free this element and the ``next`` pointer
1121e548cdefSMauro Carvalho Chehabchanges, the reader will jump off into garbage and crash. We need to
1122e548cdefSMauro Carvalho Chehabwait until we know that all the readers who were traversing the list
1123e548cdefSMauro Carvalho Chehabwhen we deleted the element are finished. We use
1124b1735296SStephen Boydcall_rcu() to register a callback which will actually
1125e548cdefSMauro Carvalho Chehabdestroy the object once all pre-existing readers are finished.
1126b1735296SStephen BoydAlternatively, synchronize_rcu() may be used to block
1127e548cdefSMauro Carvalho Chehabuntil all pre-existing are finished.
1128e548cdefSMauro Carvalho Chehab
1129e548cdefSMauro Carvalho ChehabBut how does Read Copy Update know when the readers are finished? The
1130e548cdefSMauro Carvalho Chehabmethod is this: firstly, the readers always traverse the list inside
1131b1735296SStephen Boydrcu_read_lock()/rcu_read_unlock() pairs:
1132e548cdefSMauro Carvalho Chehabthese simply disable preemption so the reader won't go to sleep while
1133e548cdefSMauro Carvalho Chehabreading the list.
1134e548cdefSMauro Carvalho Chehab
1135e548cdefSMauro Carvalho ChehabRCU then waits until every other CPU has slept at least once: since
1136e548cdefSMauro Carvalho Chehabreaders cannot sleep, we know that any readers which were traversing the
1137e548cdefSMauro Carvalho Chehablist during the deletion are finished, and the callback is triggered.
1138e548cdefSMauro Carvalho ChehabThe real Read Copy Update code is a little more optimized than this, but
1139e548cdefSMauro Carvalho Chehabthis is the fundamental idea.
1140e548cdefSMauro Carvalho Chehab
1141e548cdefSMauro Carvalho Chehab::
1142e548cdefSMauro Carvalho Chehab
1143e548cdefSMauro Carvalho Chehab    --- cache.c.perobjectlock   2003-12-11 17:15:03.000000000 +1100
1144e548cdefSMauro Carvalho Chehab    +++ cache.c.rcupdate    2003-12-11 17:55:14.000000000 +1100
1145e548cdefSMauro Carvalho Chehab    @@ -1,15 +1,18 @@
1146e548cdefSMauro Carvalho Chehab     #include <linux/list.h>
1147e548cdefSMauro Carvalho Chehab     #include <linux/slab.h>
1148e548cdefSMauro Carvalho Chehab     #include <linux/string.h>
1149e548cdefSMauro Carvalho Chehab    +#include <linux/rcupdate.h>
1150e548cdefSMauro Carvalho Chehab     #include <linux/mutex.h>
1151e548cdefSMauro Carvalho Chehab     #include <asm/errno.h>
1152e548cdefSMauro Carvalho Chehab
1153e548cdefSMauro Carvalho Chehab     struct object
1154e548cdefSMauro Carvalho Chehab     {
1155e548cdefSMauro Carvalho Chehab    -        /* These two protected by cache_lock. */
1156e548cdefSMauro Carvalho Chehab    +        /* This is protected by RCU */
1157e548cdefSMauro Carvalho Chehab             struct list_head list;
1158e548cdefSMauro Carvalho Chehab             int popularity;
1159e548cdefSMauro Carvalho Chehab
1160e548cdefSMauro Carvalho Chehab    +        struct rcu_head rcu;
1161e548cdefSMauro Carvalho Chehab    +
1162e548cdefSMauro Carvalho Chehab             atomic_t refcnt;
1163e548cdefSMauro Carvalho Chehab
1164e548cdefSMauro Carvalho Chehab             /* Doesn't change once created. */
1165e548cdefSMauro Carvalho Chehab    @@ -40,7 +43,7 @@
1166e548cdefSMauro Carvalho Chehab     {
1167e548cdefSMauro Carvalho Chehab             struct object *i;
1168e548cdefSMauro Carvalho Chehab
1169e548cdefSMauro Carvalho Chehab    -        list_for_each_entry(i, &cache, list) {
1170e548cdefSMauro Carvalho Chehab    +        list_for_each_entry_rcu(i, &cache, list) {
1171e548cdefSMauro Carvalho Chehab                     if (i->id == id) {
1172e548cdefSMauro Carvalho Chehab                             i->popularity++;
1173e548cdefSMauro Carvalho Chehab                             return i;
1174e548cdefSMauro Carvalho Chehab    @@ -49,19 +52,25 @@
1175e548cdefSMauro Carvalho Chehab             return NULL;
1176e548cdefSMauro Carvalho Chehab     }
1177e548cdefSMauro Carvalho Chehab
1178e548cdefSMauro Carvalho Chehab    +/* Final discard done once we know no readers are looking. */
1179e548cdefSMauro Carvalho Chehab    +static void cache_delete_rcu(void *arg)
1180e548cdefSMauro Carvalho Chehab    +{
1181e548cdefSMauro Carvalho Chehab    +        object_put(arg);
1182e548cdefSMauro Carvalho Chehab    +}
1183e548cdefSMauro Carvalho Chehab    +
1184e548cdefSMauro Carvalho Chehab     /* Must be holding cache_lock */
1185e548cdefSMauro Carvalho Chehab     static void __cache_delete(struct object *obj)
1186e548cdefSMauro Carvalho Chehab     {
1187e548cdefSMauro Carvalho Chehab             BUG_ON(!obj);
1188e548cdefSMauro Carvalho Chehab    -        list_del(&obj->list);
1189e548cdefSMauro Carvalho Chehab    -        object_put(obj);
1190e548cdefSMauro Carvalho Chehab    +        list_del_rcu(&obj->list);
1191e548cdefSMauro Carvalho Chehab             cache_num--;
1192e548cdefSMauro Carvalho Chehab    +        call_rcu(&obj->rcu, cache_delete_rcu);
1193e548cdefSMauro Carvalho Chehab     }
1194e548cdefSMauro Carvalho Chehab
1195e548cdefSMauro Carvalho Chehab     /* Must be holding cache_lock */
1196e548cdefSMauro Carvalho Chehab     static void __cache_add(struct object *obj)
1197e548cdefSMauro Carvalho Chehab     {
1198e548cdefSMauro Carvalho Chehab    -        list_add(&obj->list, &cache);
1199e548cdefSMauro Carvalho Chehab    +        list_add_rcu(&obj->list, &cache);
1200e548cdefSMauro Carvalho Chehab             if (++cache_num > MAX_CACHE_SIZE) {
1201e548cdefSMauro Carvalho Chehab                     struct object *i, *outcast = NULL;
1202e548cdefSMauro Carvalho Chehab                     list_for_each_entry(i, &cache, list) {
1203e548cdefSMauro Carvalho Chehab    @@ -104,12 +114,11 @@
1204e548cdefSMauro Carvalho Chehab     struct object *cache_find(int id)
1205e548cdefSMauro Carvalho Chehab     {
1206e548cdefSMauro Carvalho Chehab             struct object *obj;
1207e548cdefSMauro Carvalho Chehab    -        unsigned long flags;
1208e548cdefSMauro Carvalho Chehab
1209e548cdefSMauro Carvalho Chehab    -        spin_lock_irqsave(&cache_lock, flags);
1210e548cdefSMauro Carvalho Chehab    +        rcu_read_lock();
1211e548cdefSMauro Carvalho Chehab             obj = __cache_find(id);
1212e548cdefSMauro Carvalho Chehab             if (obj)
1213e548cdefSMauro Carvalho Chehab                     object_get(obj);
1214e548cdefSMauro Carvalho Chehab    -        spin_unlock_irqrestore(&cache_lock, flags);
1215e548cdefSMauro Carvalho Chehab    +        rcu_read_unlock();
1216e548cdefSMauro Carvalho Chehab             return obj;
1217e548cdefSMauro Carvalho Chehab     }
1218e548cdefSMauro Carvalho Chehab
1219e548cdefSMauro Carvalho ChehabNote that the reader will alter the popularity member in
1220b1735296SStephen Boyd__cache_find(), and now it doesn't hold a lock. One
1221e548cdefSMauro Carvalho Chehabsolution would be to make it an ``atomic_t``, but for this usage, we
1222e548cdefSMauro Carvalho Chehabdon't really care about races: an approximate result is good enough, so
1223e548cdefSMauro Carvalho ChehabI didn't change it.
1224e548cdefSMauro Carvalho Chehab
1225b1735296SStephen BoydThe result is that cache_find() requires no
1226e548cdefSMauro Carvalho Chehabsynchronization with any other functions, so is almost as fast on SMP as
1227e548cdefSMauro Carvalho Chehabit would be on UP.
1228e548cdefSMauro Carvalho Chehab
1229e548cdefSMauro Carvalho ChehabThere is a further optimization possible here: remember our original
1230e548cdefSMauro Carvalho Chehabcache code, where there were no reference counts and the caller simply
1231e548cdefSMauro Carvalho Chehabheld the lock whenever using the object? This is still possible: if you
1232e548cdefSMauro Carvalho Chehabhold the lock, no one can delete the object, so you don't need to get
1233e548cdefSMauro Carvalho Chehaband put the reference count.
1234e548cdefSMauro Carvalho Chehab
1235e548cdefSMauro Carvalho ChehabNow, because the 'read lock' in RCU is simply disabling preemption, a
1236e548cdefSMauro Carvalho Chehabcaller which always has preemption disabled between calling
1237b1735296SStephen Boydcache_find() and object_put() does not
1238e548cdefSMauro Carvalho Chehabneed to actually get and put the reference count: we could expose
1239b1735296SStephen Boyd__cache_find() by making it non-static, and such
1240e548cdefSMauro Carvalho Chehabcallers could simply call that.
1241e548cdefSMauro Carvalho Chehab
1242e548cdefSMauro Carvalho ChehabThe benefit here is that the reference count is not written to: the
1243e548cdefSMauro Carvalho Chehabobject is not altered in any way, which is much faster on SMP machines
1244e548cdefSMauro Carvalho Chehabdue to caching.
1245e548cdefSMauro Carvalho Chehab
1246e548cdefSMauro Carvalho ChehabPer-CPU Data
1247e548cdefSMauro Carvalho Chehab------------
1248e548cdefSMauro Carvalho Chehab
1249e548cdefSMauro Carvalho ChehabAnother technique for avoiding locking which is used fairly widely is to
1250e548cdefSMauro Carvalho Chehabduplicate information for each CPU. For example, if you wanted to keep a
1251e548cdefSMauro Carvalho Chehabcount of a common condition, you could use a spin lock and a single
1252e548cdefSMauro Carvalho Chehabcounter. Nice and simple.
1253e548cdefSMauro Carvalho Chehab
1254e548cdefSMauro Carvalho ChehabIf that was too slow (it's usually not, but if you've got a really big
1255e548cdefSMauro Carvalho Chehabmachine to test on and can show that it is), you could instead use a
1256e548cdefSMauro Carvalho Chehabcounter for each CPU, then none of them need an exclusive lock. See
1257b1735296SStephen BoydDEFINE_PER_CPU(), get_cpu_var() and
1258b1735296SStephen Boydput_cpu_var() (``include/linux/percpu.h``).
1259e548cdefSMauro Carvalho Chehab
1260e548cdefSMauro Carvalho ChehabOf particular use for simple per-cpu counters is the ``local_t`` type,
1261b1735296SStephen Boydand the cpu_local_inc() and related functions, which are
1262e548cdefSMauro Carvalho Chehabmore efficient than simple code on some architectures
1263e548cdefSMauro Carvalho Chehab(``include/asm/local.h``).
1264e548cdefSMauro Carvalho Chehab
1265e548cdefSMauro Carvalho ChehabNote that there is no simple, reliable way of getting an exact value of
1266e548cdefSMauro Carvalho Chehabsuch a counter, without introducing more locks. This is not a problem
1267e548cdefSMauro Carvalho Chehabfor some uses.
1268e548cdefSMauro Carvalho Chehab
1269e548cdefSMauro Carvalho ChehabData Which Mostly Used By An IRQ Handler
1270e548cdefSMauro Carvalho Chehab----------------------------------------
1271e548cdefSMauro Carvalho Chehab
1272e548cdefSMauro Carvalho ChehabIf data is always accessed from within the same IRQ handler, you don't
1273e548cdefSMauro Carvalho Chehabneed a lock at all: the kernel already guarantees that the irq handler
1274e548cdefSMauro Carvalho Chehabwill not run simultaneously on multiple CPUs.
1275e548cdefSMauro Carvalho Chehab
1276e548cdefSMauro Carvalho ChehabManfred Spraul points out that you can still do this, even if the data
1277e548cdefSMauro Carvalho Chehabis very occasionally accessed in user context or softirqs/tasklets. The
1278e548cdefSMauro Carvalho Chehabirq handler doesn't use a lock, and all other accesses are done as so::
1279e548cdefSMauro Carvalho Chehab
1280*379af13bSAlexander Sverdlin        mutex_lock(&lock);
1281e548cdefSMauro Carvalho Chehab        disable_irq(irq);
1282e548cdefSMauro Carvalho Chehab        ...
1283e548cdefSMauro Carvalho Chehab        enable_irq(irq);
1284*379af13bSAlexander Sverdlin        mutex_unlock(&lock);
1285e548cdefSMauro Carvalho Chehab
1286b1735296SStephen BoydThe disable_irq() prevents the irq handler from running
1287e548cdefSMauro Carvalho Chehab(and waits for it to finish if it's currently running on other CPUs).
1288e548cdefSMauro Carvalho ChehabThe spinlock prevents any other accesses happening at the same time.
1289b1735296SStephen BoydNaturally, this is slower than just a spin_lock_irq()
1290e548cdefSMauro Carvalho Chehabcall, so it only makes sense if this type of access happens extremely
1291e548cdefSMauro Carvalho Chehabrarely.
1292e548cdefSMauro Carvalho Chehab
1293e548cdefSMauro Carvalho ChehabWhat Functions Are Safe To Call From Interrupts?
1294e548cdefSMauro Carvalho Chehab================================================
1295e548cdefSMauro Carvalho Chehab
1296e548cdefSMauro Carvalho ChehabMany functions in the kernel sleep (ie. call schedule()) directly or
1297e548cdefSMauro Carvalho Chehabindirectly: you can never call them while holding a spinlock, or with
1298e548cdefSMauro Carvalho Chehabpreemption disabled. This also means you need to be in user context:
1299e548cdefSMauro Carvalho Chehabcalling them from an interrupt is illegal.
1300e548cdefSMauro Carvalho Chehab
1301e548cdefSMauro Carvalho ChehabSome Functions Which Sleep
1302e548cdefSMauro Carvalho Chehab--------------------------
1303e548cdefSMauro Carvalho Chehab
1304e548cdefSMauro Carvalho ChehabThe most common ones are listed below, but you usually have to read the
1305e548cdefSMauro Carvalho Chehabcode to find out if other calls are safe. If everyone else who calls it
1306e548cdefSMauro Carvalho Chehabcan sleep, you probably need to be able to sleep, too. In particular,
1307e548cdefSMauro Carvalho Chehabregistration and deregistration functions usually expect to be called
1308e548cdefSMauro Carvalho Chehabfrom user context, and can sleep.
1309e548cdefSMauro Carvalho Chehab
1310e548cdefSMauro Carvalho Chehab-  Accesses to userspace:
1311e548cdefSMauro Carvalho Chehab
1312b1735296SStephen Boyd   -  copy_from_user()
1313e548cdefSMauro Carvalho Chehab
1314b1735296SStephen Boyd   -  copy_to_user()
1315e548cdefSMauro Carvalho Chehab
1316b1735296SStephen Boyd   -  get_user()
1317e548cdefSMauro Carvalho Chehab
1318b1735296SStephen Boyd   -  put_user()
1319e548cdefSMauro Carvalho Chehab
1320b1735296SStephen Boyd-  kmalloc(GP_KERNEL) <kmalloc>`
1321e548cdefSMauro Carvalho Chehab
1322b1735296SStephen Boyd-  mutex_lock_interruptible() and
1323b1735296SStephen Boyd   mutex_lock()
1324e548cdefSMauro Carvalho Chehab
1325b1735296SStephen Boyd   There is a mutex_trylock() which does not sleep.
1326e548cdefSMauro Carvalho Chehab   Still, it must not be used inside interrupt context since its
1327b1735296SStephen Boyd   implementation is not safe for that. mutex_unlock()
1328e548cdefSMauro Carvalho Chehab   will also never sleep. It cannot be used in interrupt context either
1329e548cdefSMauro Carvalho Chehab   since a mutex must be released by the same task that acquired it.
1330e548cdefSMauro Carvalho Chehab
1331e548cdefSMauro Carvalho ChehabSome Functions Which Don't Sleep
1332e548cdefSMauro Carvalho Chehab--------------------------------
1333e548cdefSMauro Carvalho Chehab
1334e548cdefSMauro Carvalho ChehabSome functions are safe to call from any context, or holding almost any
1335e548cdefSMauro Carvalho Chehablock.
1336e548cdefSMauro Carvalho Chehab
1337b1735296SStephen Boyd-  printk()
1338e548cdefSMauro Carvalho Chehab
1339b1735296SStephen Boyd-  kfree()
1340e548cdefSMauro Carvalho Chehab
134187bdd932SThomas Gleixner-  add_timer() and timer_delete()
1342e548cdefSMauro Carvalho Chehab
1343e548cdefSMauro Carvalho ChehabMutex API reference
1344e548cdefSMauro Carvalho Chehab===================
1345e548cdefSMauro Carvalho Chehab
1346e548cdefSMauro Carvalho Chehab.. kernel-doc:: include/linux/mutex.h
1347e548cdefSMauro Carvalho Chehab   :internal:
1348e548cdefSMauro Carvalho Chehab
1349e548cdefSMauro Carvalho Chehab.. kernel-doc:: kernel/locking/mutex.c
1350e548cdefSMauro Carvalho Chehab   :export:
1351e548cdefSMauro Carvalho Chehab
1352e548cdefSMauro Carvalho ChehabFutex API reference
1353e548cdefSMauro Carvalho Chehab===================
1354e548cdefSMauro Carvalho Chehab
1355bc67f1c4SAndré Almeida.. kernel-doc:: kernel/futex/core.c
1356bc67f1c4SAndré Almeida   :internal:
1357bc67f1c4SAndré Almeida
1358bc67f1c4SAndré Almeida.. kernel-doc:: kernel/futex/futex.h
1359bc67f1c4SAndré Almeida   :internal:
1360bc67f1c4SAndré Almeida
1361bc67f1c4SAndré Almeida.. kernel-doc:: kernel/futex/pi.c
1362bc67f1c4SAndré Almeida   :internal:
1363bc67f1c4SAndré Almeida
1364bc67f1c4SAndré Almeida.. kernel-doc:: kernel/futex/requeue.c
1365bc67f1c4SAndré Almeida   :internal:
1366bc67f1c4SAndré Almeida
1367bc67f1c4SAndré Almeida.. kernel-doc:: kernel/futex/waitwake.c
1368e548cdefSMauro Carvalho Chehab   :internal:
1369e548cdefSMauro Carvalho Chehab
1370e548cdefSMauro Carvalho ChehabFurther reading
1371e548cdefSMauro Carvalho Chehab===============
1372e548cdefSMauro Carvalho Chehab
1373387b1468SMauro Carvalho Chehab-  ``Documentation/locking/spinlocks.rst``: Linus Torvalds' spinlocking
1374e548cdefSMauro Carvalho Chehab   tutorial in the kernel sources.
1375e548cdefSMauro Carvalho Chehab
1376e548cdefSMauro Carvalho Chehab-  Unix Systems for Modern Architectures: Symmetric Multiprocessing and
1377e548cdefSMauro Carvalho Chehab   Caching for Kernel Programmers:
1378e548cdefSMauro Carvalho Chehab
1379e548cdefSMauro Carvalho Chehab   Curt Schimmel's very good introduction to kernel level locking (not
1380e548cdefSMauro Carvalho Chehab   written for Linux, but nearly everything applies). The book is
1381e548cdefSMauro Carvalho Chehab   expensive, but really worth every penny to understand SMP locking.
1382e548cdefSMauro Carvalho Chehab   [ISBN: 0201633388]
1383e548cdefSMauro Carvalho Chehab
1384e548cdefSMauro Carvalho ChehabThanks
1385e548cdefSMauro Carvalho Chehab======
1386e548cdefSMauro Carvalho Chehab
1387e548cdefSMauro Carvalho ChehabThanks to Telsa Gwynne for DocBooking, neatening and adding style.
1388e548cdefSMauro Carvalho Chehab
1389e548cdefSMauro Carvalho ChehabThanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul Mackerras,
1390e548cdefSMauro Carvalho ChehabRuedi Aschwanden, Alan Cox, Manfred Spraul, Tim Waugh, Pete Zaitcev,
1391e548cdefSMauro Carvalho ChehabJames Morris, Robert Love, Paul McKenney, John Ashby for proofreading,
1392e548cdefSMauro Carvalho Chehabcorrecting, flaming, commenting.
1393e548cdefSMauro Carvalho Chehab
1394e548cdefSMauro Carvalho ChehabThanks to the cabal for having no influence on this document.
1395e548cdefSMauro Carvalho Chehab
1396e548cdefSMauro Carvalho ChehabGlossary
1397e548cdefSMauro Carvalho Chehab========
1398e548cdefSMauro Carvalho Chehab
1399e548cdefSMauro Carvalho Chehabpreemption
1400e548cdefSMauro Carvalho Chehab  Prior to 2.5, or when ``CONFIG_PREEMPT`` is unset, processes in user
1401e548cdefSMauro Carvalho Chehab  context inside the kernel would not preempt each other (ie. you had that
1402e548cdefSMauro Carvalho Chehab  CPU until you gave it up, except for interrupts). With the addition of
1403e548cdefSMauro Carvalho Chehab  ``CONFIG_PREEMPT`` in 2.5.4, this changed: when in user context, higher
1404e548cdefSMauro Carvalho Chehab  priority tasks can "cut in": spinlocks were changed to disable
1405e548cdefSMauro Carvalho Chehab  preemption, even on UP.
1406e548cdefSMauro Carvalho Chehab
1407e548cdefSMauro Carvalho Chehabbh
1408e548cdefSMauro Carvalho Chehab  Bottom Half: for historical reasons, functions with '_bh' in them often
1409b1735296SStephen Boyd  now refer to any software interrupt, e.g. spin_lock_bh()
1410e548cdefSMauro Carvalho Chehab  blocks any software interrupt on the current CPU. Bottom halves are
1411e548cdefSMauro Carvalho Chehab  deprecated, and will eventually be replaced by tasklets. Only one bottom
1412e548cdefSMauro Carvalho Chehab  half will be running at any time.
1413e548cdefSMauro Carvalho Chehab
1414e548cdefSMauro Carvalho ChehabHardware Interrupt / Hardware IRQ
1415fe450eebSChangbin Du  Hardware interrupt request. in_hardirq() returns true in a
1416e548cdefSMauro Carvalho Chehab  hardware interrupt handler.
1417e548cdefSMauro Carvalho Chehab
1418e548cdefSMauro Carvalho ChehabInterrupt Context
1419e548cdefSMauro Carvalho Chehab  Not user context: processing a hardware irq or software irq. Indicated
1420b1735296SStephen Boyd  by the in_interrupt() macro returning true.
1421e548cdefSMauro Carvalho Chehab
1422e548cdefSMauro Carvalho ChehabSMP
1423e548cdefSMauro Carvalho Chehab  Symmetric Multi-Processor: kernels compiled for multiple-CPU machines.
1424e548cdefSMauro Carvalho Chehab  (``CONFIG_SMP=y``).
1425e548cdefSMauro Carvalho Chehab
1426e548cdefSMauro Carvalho ChehabSoftware Interrupt / softirq
1427fe450eebSChangbin Du  Software interrupt handler. in_hardirq() returns false;
1428b1735296SStephen Boyd  in_softirq() returns true. Tasklets and softirqs both
1429e548cdefSMauro Carvalho Chehab  fall into the category of 'software interrupts'.
1430e548cdefSMauro Carvalho Chehab
1431e548cdefSMauro Carvalho Chehab  Strictly speaking a softirq is one of up to 32 enumerated software
1432e548cdefSMauro Carvalho Chehab  interrupts which can run on multiple CPUs at once. Sometimes used to
1433e548cdefSMauro Carvalho Chehab  refer to tasklets as well (ie. all software interrupts).
1434e548cdefSMauro Carvalho Chehab
1435e548cdefSMauro Carvalho Chehabtasklet
1436e548cdefSMauro Carvalho Chehab  A dynamically-registrable software interrupt, which is guaranteed to
1437e548cdefSMauro Carvalho Chehab  only run on one CPU at a time.
1438e548cdefSMauro Carvalho Chehab
1439e548cdefSMauro Carvalho Chehabtimer
1440e548cdefSMauro Carvalho Chehab  A dynamically-registrable software interrupt, which is run at (or close
1441e548cdefSMauro Carvalho Chehab  to) a given time. When running, it is just like a tasklet (in fact, they
1442dc89fca9SMauro Carvalho Chehab  are called from the ``TIMER_SOFTIRQ``).
1443e548cdefSMauro Carvalho Chehab
1444e548cdefSMauro Carvalho ChehabUP
1445dc89fca9SMauro Carvalho Chehab  Uni-Processor: Non-SMP. (``CONFIG_SMP=n``).
1446e548cdefSMauro Carvalho Chehab
1447e548cdefSMauro Carvalho ChehabUser Context
1448e548cdefSMauro Carvalho Chehab  The kernel executing on behalf of a particular process (ie. a system
1449e548cdefSMauro Carvalho Chehab  call or trap) or kernel thread. You can tell which process with the
1450e548cdefSMauro Carvalho Chehab  ``current`` macro.) Not to be confused with userspace. Can be
1451e548cdefSMauro Carvalho Chehab  interrupted by software or hardware interrupts.
1452e548cdefSMauro Carvalho Chehab
1453e548cdefSMauro Carvalho ChehabUserspace
1454e548cdefSMauro Carvalho Chehab  A process executing its own code outside the kernel.
1455