11497624fSFederico Vaga.. _kernel_hacking_lock: 21497624fSFederico Vaga 3e548cdefSMauro Carvalho Chehab=========================== 4e548cdefSMauro Carvalho ChehabUnreliable Guide To Locking 5e548cdefSMauro Carvalho Chehab=========================== 6e548cdefSMauro Carvalho Chehab 7e548cdefSMauro Carvalho Chehab:Author: Rusty Russell 8e548cdefSMauro Carvalho Chehab 9e548cdefSMauro Carvalho ChehabIntroduction 10e548cdefSMauro Carvalho Chehab============ 11e548cdefSMauro Carvalho Chehab 12e548cdefSMauro Carvalho ChehabWelcome, to Rusty's Remarkably Unreliable Guide to Kernel Locking 13e548cdefSMauro Carvalho Chehabissues. This document describes the locking systems in the Linux Kernel 14e548cdefSMauro Carvalho Chehabin 2.6. 15e548cdefSMauro Carvalho Chehab 16e548cdefSMauro Carvalho ChehabWith the wide availability of HyperThreading, and preemption in the 17e548cdefSMauro Carvalho ChehabLinux Kernel, everyone hacking on the kernel needs to know the 18e548cdefSMauro Carvalho Chehabfundamentals of concurrency and locking for SMP. 19e548cdefSMauro Carvalho Chehab 20e548cdefSMauro Carvalho ChehabThe Problem With Concurrency 21e548cdefSMauro Carvalho Chehab============================ 22e548cdefSMauro Carvalho Chehab 23e548cdefSMauro Carvalho Chehab(Skip this if you know what a Race Condition is). 24e548cdefSMauro Carvalho Chehab 25e548cdefSMauro Carvalho ChehabIn a normal program, you can increment a counter like so: 26e548cdefSMauro Carvalho Chehab 27e548cdefSMauro Carvalho Chehab:: 28e548cdefSMauro Carvalho Chehab 29e548cdefSMauro Carvalho Chehab very_important_count++; 30e548cdefSMauro Carvalho Chehab 31e548cdefSMauro Carvalho Chehab 32e548cdefSMauro Carvalho ChehabThis is what they would expect to happen: 33e548cdefSMauro Carvalho Chehab 34475c5ef8SMauro Carvalho Chehab 35475c5ef8SMauro Carvalho Chehab.. table:: Expected Results 36475c5ef8SMauro Carvalho Chehab 37e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 38e548cdefSMauro Carvalho Chehab | Instance 1 | Instance 2 | 39e548cdefSMauro Carvalho Chehab +====================================+====================================+ 40e548cdefSMauro Carvalho Chehab | read very_important_count (5) | | 41e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 42e548cdefSMauro Carvalho Chehab | add 1 (6) | | 43e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 44e548cdefSMauro Carvalho Chehab | write very_important_count (6) | | 45e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 46e548cdefSMauro Carvalho Chehab | | read very_important_count (6) | 47e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 48e548cdefSMauro Carvalho Chehab | | add 1 (7) | 49e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 50e548cdefSMauro Carvalho Chehab | | write very_important_count (7) | 51e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 52e548cdefSMauro Carvalho Chehab 53e548cdefSMauro Carvalho ChehabThis is what might happen: 54e548cdefSMauro Carvalho Chehab 55475c5ef8SMauro Carvalho Chehab.. table:: Possible Results 56475c5ef8SMauro Carvalho Chehab 57e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 58e548cdefSMauro Carvalho Chehab | Instance 1 | Instance 2 | 59e548cdefSMauro Carvalho Chehab +====================================+====================================+ 60e548cdefSMauro Carvalho Chehab | read very_important_count (5) | | 61e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 62e548cdefSMauro Carvalho Chehab | | read very_important_count (5) | 63e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 64e548cdefSMauro Carvalho Chehab | add 1 (6) | | 65e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 66e548cdefSMauro Carvalho Chehab | | add 1 (6) | 67e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 68e548cdefSMauro Carvalho Chehab | write very_important_count (6) | | 69e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 70e548cdefSMauro Carvalho Chehab | | write very_important_count (6) | 71e548cdefSMauro Carvalho Chehab +------------------------------------+------------------------------------+ 72e548cdefSMauro Carvalho Chehab 73e548cdefSMauro Carvalho Chehab 74e548cdefSMauro Carvalho ChehabRace Conditions and Critical Regions 75e548cdefSMauro Carvalho Chehab------------------------------------ 76e548cdefSMauro Carvalho Chehab 77e548cdefSMauro Carvalho ChehabThis overlap, where the result depends on the relative timing of 78e548cdefSMauro Carvalho Chehabmultiple tasks, is called a race condition. The piece of code containing 79e548cdefSMauro Carvalho Chehabthe concurrency issue is called a critical region. And especially since 80e548cdefSMauro Carvalho ChehabLinux starting running on SMP machines, they became one of the major 81e548cdefSMauro Carvalho Chehabissues in kernel design and implementation. 82e548cdefSMauro Carvalho Chehab 83e548cdefSMauro Carvalho ChehabPreemption can have the same effect, even if there is only one CPU: by 84e548cdefSMauro Carvalho Chehabpreempting one task during the critical region, we have exactly the same 85e548cdefSMauro Carvalho Chehabrace condition. In this case the thread which preempts might run the 86e548cdefSMauro Carvalho Chehabcritical region itself. 87e548cdefSMauro Carvalho Chehab 88e548cdefSMauro Carvalho ChehabThe solution is to recognize when these simultaneous accesses occur, and 89e548cdefSMauro Carvalho Chehabuse locks to make sure that only one instance can enter the critical 90e548cdefSMauro Carvalho Chehabregion at any time. There are many friendly primitives in the Linux 91e548cdefSMauro Carvalho Chehabkernel to help you do this. And then there are the unfriendly 92e548cdefSMauro Carvalho Chehabprimitives, but I'll pretend they don't exist. 93e548cdefSMauro Carvalho Chehab 94e548cdefSMauro Carvalho ChehabLocking in the Linux Kernel 95e548cdefSMauro Carvalho Chehab=========================== 96e548cdefSMauro Carvalho Chehab 97abf36fe0SAlyssa RosenzweigIf I could give you one piece of advice on locking: **keep it simple**. 98e548cdefSMauro Carvalho Chehab 99e548cdefSMauro Carvalho ChehabBe reluctant to introduce new locks. 100e548cdefSMauro Carvalho Chehab 101e548cdefSMauro Carvalho ChehabTwo Main Types of Kernel Locks: Spinlocks and Mutexes 102e548cdefSMauro Carvalho Chehab----------------------------------------------------- 103e548cdefSMauro Carvalho Chehab 104e548cdefSMauro Carvalho ChehabThere are two main types of kernel locks. The fundamental type is the 105e548cdefSMauro Carvalho Chehabspinlock (``include/asm/spinlock.h``), which is a very simple 106e548cdefSMauro Carvalho Chehabsingle-holder lock: if you can't get the spinlock, you keep trying 107e548cdefSMauro Carvalho Chehab(spinning) until you can. Spinlocks are very small and fast, and can be 108e548cdefSMauro Carvalho Chehabused anywhere. 109e548cdefSMauro Carvalho Chehab 110e548cdefSMauro Carvalho ChehabThe second type is a mutex (``include/linux/mutex.h``): it is like a 111e548cdefSMauro Carvalho Chehabspinlock, but you may block holding a mutex. If you can't lock a mutex, 112e548cdefSMauro Carvalho Chehabyour task will suspend itself, and be woken up when the mutex is 113e548cdefSMauro Carvalho Chehabreleased. This means the CPU can do something else while you are 114e548cdefSMauro Carvalho Chehabwaiting. There are many cases when you simply can't sleep (see 1154f8af077SNícolas F. R. A. Prado`What Functions Are Safe To Call From Interrupts?`_), 116e548cdefSMauro Carvalho Chehaband so have to use a spinlock instead. 117e548cdefSMauro Carvalho Chehab 118e548cdefSMauro Carvalho ChehabNeither type of lock is recursive: see 1194f8af077SNícolas F. R. A. Prado`Deadlock: Simple and Advanced`_. 120e548cdefSMauro Carvalho Chehab 121e548cdefSMauro Carvalho ChehabLocks and Uniprocessor Kernels 122e548cdefSMauro Carvalho Chehab------------------------------ 123e548cdefSMauro Carvalho Chehab 124e548cdefSMauro Carvalho ChehabFor kernels compiled without ``CONFIG_SMP``, and without 125e548cdefSMauro Carvalho Chehab``CONFIG_PREEMPT`` spinlocks do not exist at all. This is an excellent 126e548cdefSMauro Carvalho Chehabdesign decision: when no-one else can run at the same time, there is no 127e548cdefSMauro Carvalho Chehabreason to have a lock. 128e548cdefSMauro Carvalho Chehab 129e548cdefSMauro Carvalho ChehabIf the kernel is compiled without ``CONFIG_SMP``, but ``CONFIG_PREEMPT`` 130e548cdefSMauro Carvalho Chehabis set, then spinlocks simply disable preemption, which is sufficient to 131e548cdefSMauro Carvalho Chehabprevent any races. For most purposes, we can think of preemption as 132e548cdefSMauro Carvalho Chehabequivalent to SMP, and not worry about it separately. 133e548cdefSMauro Carvalho Chehab 134e548cdefSMauro Carvalho ChehabYou should always test your locking code with ``CONFIG_SMP`` and 135e548cdefSMauro Carvalho Chehab``CONFIG_PREEMPT`` enabled, even if you don't have an SMP test box, 136e548cdefSMauro Carvalho Chehabbecause it will still catch some kinds of locking bugs. 137e548cdefSMauro Carvalho Chehab 138e548cdefSMauro Carvalho ChehabMutexes still exist, because they are required for synchronization 139e548cdefSMauro Carvalho Chehabbetween user contexts, as we will see below. 140e548cdefSMauro Carvalho Chehab 141e548cdefSMauro Carvalho ChehabLocking Only In User Context 142e548cdefSMauro Carvalho Chehab---------------------------- 143e548cdefSMauro Carvalho Chehab 144e548cdefSMauro Carvalho ChehabIf you have a data structure which is only ever accessed from user 145e548cdefSMauro Carvalho Chehabcontext, then you can use a simple mutex (``include/linux/mutex.h``) to 146e548cdefSMauro Carvalho Chehabprotect it. This is the most trivial case: you initialize the mutex. 147b1735296SStephen BoydThen you can call mutex_lock_interruptible() to grab the 148b1735296SStephen Boydmutex, and mutex_unlock() to release it. There is also a 149b1735296SStephen Boydmutex_lock(), which should be avoided, because it will 150e548cdefSMauro Carvalho Chehabnot return if a signal is received. 151e548cdefSMauro Carvalho Chehab 152e548cdefSMauro Carvalho ChehabExample: ``net/netfilter/nf_sockopt.c`` allows registration of new 153b1735296SStephen Boydsetsockopt() and getsockopt() calls, with 154b1735296SStephen Boydnf_register_sockopt(). Registration and de-registration 155e548cdefSMauro Carvalho Chehabare only done on module load and unload (and boot time, where there is 156e548cdefSMauro Carvalho Chehabno concurrency), and the list of registrations is only consulted for an 157b1735296SStephen Boydunknown setsockopt() or getsockopt() system 158e548cdefSMauro Carvalho Chehabcall. The ``nf_sockopt_mutex`` is perfect to protect this, especially 159e548cdefSMauro Carvalho Chehabsince the setsockopt and getsockopt calls may well sleep. 160e548cdefSMauro Carvalho Chehab 161e548cdefSMauro Carvalho ChehabLocking Between User Context and Softirqs 162e548cdefSMauro Carvalho Chehab----------------------------------------- 163e548cdefSMauro Carvalho Chehab 164e548cdefSMauro Carvalho ChehabIf a softirq shares data with user context, you have two problems. 165e548cdefSMauro Carvalho ChehabFirstly, the current user context can be interrupted by a softirq, and 166e548cdefSMauro Carvalho Chehabsecondly, the critical region could be entered from another CPU. This is 167b1735296SStephen Boydwhere spin_lock_bh() (``include/linux/spinlock.h``) is 168e548cdefSMauro Carvalho Chehabused. It disables softirqs on that CPU, then grabs the lock. 169b1735296SStephen Boydspin_unlock_bh() does the reverse. (The '_bh' suffix is 170e548cdefSMauro Carvalho Chehaba historical reference to "Bottom Halves", the old name for software 171e548cdefSMauro Carvalho Chehabinterrupts. It should really be called spin_lock_softirq()' in a 172e548cdefSMauro Carvalho Chehabperfect world). 173e548cdefSMauro Carvalho Chehab 174b1735296SStephen BoydNote that you can also use spin_lock_irq() or 175b1735296SStephen Boydspin_lock_irqsave() here, which stop hardware interrupts 1764f8af077SNícolas F. R. A. Pradoas well: see `Hard IRQ Context`_. 177e548cdefSMauro Carvalho Chehab 178e548cdefSMauro Carvalho ChehabThis works perfectly for UP as well: the spin lock vanishes, and this 179b1735296SStephen Boydmacro simply becomes local_bh_disable() 180e548cdefSMauro Carvalho Chehab(``include/linux/interrupt.h``), which protects you from the softirq 181e548cdefSMauro Carvalho Chehabbeing run. 182e548cdefSMauro Carvalho Chehab 183e548cdefSMauro Carvalho ChehabLocking Between User Context and Tasklets 184e548cdefSMauro Carvalho Chehab----------------------------------------- 185e548cdefSMauro Carvalho Chehab 186e548cdefSMauro Carvalho ChehabThis is exactly the same as above, because tasklets are actually run 187e548cdefSMauro Carvalho Chehabfrom a softirq. 188e548cdefSMauro Carvalho Chehab 189e548cdefSMauro Carvalho ChehabLocking Between User Context and Timers 190e548cdefSMauro Carvalho Chehab--------------------------------------- 191e548cdefSMauro Carvalho Chehab 192e548cdefSMauro Carvalho ChehabThis, too, is exactly the same as above, because timers are actually run 193e548cdefSMauro Carvalho Chehabfrom a softirq. From a locking point of view, tasklets and timers are 194e548cdefSMauro Carvalho Chehabidentical. 195e548cdefSMauro Carvalho Chehab 196e548cdefSMauro Carvalho ChehabLocking Between Tasklets/Timers 197e548cdefSMauro Carvalho Chehab------------------------------- 198e548cdefSMauro Carvalho Chehab 199e548cdefSMauro Carvalho ChehabSometimes a tasklet or timer might want to share data with another 200e548cdefSMauro Carvalho Chehabtasklet or timer. 201e548cdefSMauro Carvalho Chehab 202e548cdefSMauro Carvalho ChehabThe Same Tasklet/Timer 203e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~ 204e548cdefSMauro Carvalho Chehab 205e548cdefSMauro Carvalho ChehabSince a tasklet is never run on two CPUs at once, you don't need to 206e548cdefSMauro Carvalho Chehabworry about your tasklet being reentrant (running twice at once), even 207e548cdefSMauro Carvalho Chehabon SMP. 208e548cdefSMauro Carvalho Chehab 209e548cdefSMauro Carvalho ChehabDifferent Tasklets/Timers 210e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~ 211e548cdefSMauro Carvalho Chehab 212e548cdefSMauro Carvalho ChehabIf another tasklet/timer wants to share data with your tasklet or timer 213b1735296SStephen Boyd, you will both need to use spin_lock() and 214b1735296SStephen Boydspin_unlock() calls. spin_lock_bh() is 215e548cdefSMauro Carvalho Chehabunnecessary here, as you are already in a tasklet, and none will be run 216e548cdefSMauro Carvalho Chehabon the same CPU. 217e548cdefSMauro Carvalho Chehab 218e548cdefSMauro Carvalho ChehabLocking Between Softirqs 219e548cdefSMauro Carvalho Chehab------------------------ 220e548cdefSMauro Carvalho Chehab 221e548cdefSMauro Carvalho ChehabOften a softirq might want to share data with itself or a tasklet/timer. 222e548cdefSMauro Carvalho Chehab 223e548cdefSMauro Carvalho ChehabThe Same Softirq 224e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~ 225e548cdefSMauro Carvalho Chehab 226e548cdefSMauro Carvalho ChehabThe same softirq can run on the other CPUs: you can use a per-CPU array 2274f8af077SNícolas F. R. A. Prado(see `Per-CPU Data`_) for better performance. If you're 228e548cdefSMauro Carvalho Chehabgoing so far as to use a softirq, you probably care about scalable 229e548cdefSMauro Carvalho Chehabperformance enough to justify the extra complexity. 230e548cdefSMauro Carvalho Chehab 231b1735296SStephen BoydYou'll need to use spin_lock() and 232b1735296SStephen Boydspin_unlock() for shared data. 233e548cdefSMauro Carvalho Chehab 234e548cdefSMauro Carvalho ChehabDifferent Softirqs 235e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~ 236e548cdefSMauro Carvalho Chehab 237b1735296SStephen BoydYou'll need to use spin_lock() and 238b1735296SStephen Boydspin_unlock() for shared data, whether it be a timer, 239e548cdefSMauro Carvalho Chehabtasklet, different softirq or the same or another softirq: any of them 240e548cdefSMauro Carvalho Chehabcould be running on a different CPU. 241e548cdefSMauro Carvalho Chehab 242e548cdefSMauro Carvalho ChehabHard IRQ Context 243e548cdefSMauro Carvalho Chehab================ 244e548cdefSMauro Carvalho Chehab 245e548cdefSMauro Carvalho ChehabHardware interrupts usually communicate with a tasklet or softirq. 246e548cdefSMauro Carvalho ChehabFrequently this involves putting work in a queue, which the softirq will 247e548cdefSMauro Carvalho Chehabtake out. 248e548cdefSMauro Carvalho Chehab 249e548cdefSMauro Carvalho ChehabLocking Between Hard IRQ and Softirqs/Tasklets 250e548cdefSMauro Carvalho Chehab---------------------------------------------- 251e548cdefSMauro Carvalho Chehab 252e548cdefSMauro Carvalho ChehabIf a hardware irq handler shares data with a softirq, you have two 253e548cdefSMauro Carvalho Chehabconcerns. Firstly, the softirq processing can be interrupted by a 254e548cdefSMauro Carvalho Chehabhardware interrupt, and secondly, the critical region could be entered 255e548cdefSMauro Carvalho Chehabby a hardware interrupt on another CPU. This is where 256b1735296SStephen Boydspin_lock_irq() is used. It is defined to disable 257e548cdefSMauro Carvalho Chehabinterrupts on that cpu, then grab the lock. 258b1735296SStephen Boydspin_unlock_irq() does the reverse. 259e548cdefSMauro Carvalho Chehab 260b1735296SStephen BoydThe irq handler does not need to use spin_lock_irq(), because 261e548cdefSMauro Carvalho Chehabthe softirq cannot run while the irq handler is running: it can use 262b1735296SStephen Boydspin_lock(), which is slightly faster. The only exception 263e548cdefSMauro Carvalho Chehabwould be if a different hardware irq handler uses the same lock: 264b1735296SStephen Boydspin_lock_irq() will stop that from interrupting us. 265e548cdefSMauro Carvalho Chehab 266e548cdefSMauro Carvalho ChehabThis works perfectly for UP as well: the spin lock vanishes, and this 267b1735296SStephen Boydmacro simply becomes local_irq_disable() 268e548cdefSMauro Carvalho Chehab(``include/asm/smp.h``), which protects you from the softirq/tasklet/BH 269e548cdefSMauro Carvalho Chehabbeing run. 270e548cdefSMauro Carvalho Chehab 271b1735296SStephen Boydspin_lock_irqsave() (``include/linux/spinlock.h``) is a 272e548cdefSMauro Carvalho Chehabvariant which saves whether interrupts were on or off in a flags word, 273b1735296SStephen Boydwhich is passed to spin_unlock_irqrestore(). This means 274e548cdefSMauro Carvalho Chehabthat the same code can be used inside an hard irq handler (where 275e548cdefSMauro Carvalho Chehabinterrupts are already off) and in softirqs (where the irq disabling is 276e548cdefSMauro Carvalho Chehabrequired). 277e548cdefSMauro Carvalho Chehab 278e548cdefSMauro Carvalho ChehabNote that softirqs (and hence tasklets and timers) are run on return 279b1735296SStephen Boydfrom hardware interrupts, so spin_lock_irq() also stops 280b1735296SStephen Boydthese. In that sense, spin_lock_irqsave() is the most 281e548cdefSMauro Carvalho Chehabgeneral and powerful locking function. 282e548cdefSMauro Carvalho Chehab 283e548cdefSMauro Carvalho ChehabLocking Between Two Hard IRQ Handlers 284e548cdefSMauro Carvalho Chehab------------------------------------- 285e548cdefSMauro Carvalho Chehab 286e548cdefSMauro Carvalho ChehabIt is rare to have to share data between two IRQ handlers, but if you 287b1735296SStephen Boyddo, spin_lock_irqsave() should be used: it is 288e548cdefSMauro Carvalho Chehabarchitecture-specific whether all interrupts are disabled inside irq 289e548cdefSMauro Carvalho Chehabhandlers themselves. 290e548cdefSMauro Carvalho Chehab 291e548cdefSMauro Carvalho ChehabCheat Sheet For Locking 292e548cdefSMauro Carvalho Chehab======================= 293e548cdefSMauro Carvalho Chehab 294e548cdefSMauro Carvalho ChehabPete Zaitcev gives the following summary: 295e548cdefSMauro Carvalho Chehab 296e548cdefSMauro Carvalho Chehab- If you are in a process context (any syscall) and want to lock other 297e548cdefSMauro Carvalho Chehab process out, use a mutex. You can take a mutex and sleep 29810855b45STakahiro Itazuri (``copy_from_user()`` or ``kmalloc(x,GFP_KERNEL)``). 299e548cdefSMauro Carvalho Chehab 300e548cdefSMauro Carvalho Chehab- Otherwise (== data can be touched in an interrupt), use 301b1735296SStephen Boyd spin_lock_irqsave() and 302b1735296SStephen Boyd spin_unlock_irqrestore(). 303e548cdefSMauro Carvalho Chehab 304e548cdefSMauro Carvalho Chehab- Avoid holding spinlock for more than 5 lines of code and across any 305b1735296SStephen Boyd function call (except accessors like readb()). 306e548cdefSMauro Carvalho Chehab 307e548cdefSMauro Carvalho ChehabTable of Minimum Requirements 308e548cdefSMauro Carvalho Chehab----------------------------- 309e548cdefSMauro Carvalho Chehab 310dc89fca9SMauro Carvalho ChehabThe following table lists the **minimum** locking requirements between 311e548cdefSMauro Carvalho Chehabvarious contexts. In some cases, the same context can only be running on 312e548cdefSMauro Carvalho Chehabone CPU at a time, so no locking is required for that context (eg. a 313e548cdefSMauro Carvalho Chehabparticular thread can only run on one CPU at a time, but if it needs 314e548cdefSMauro Carvalho Chehabshares data with another thread, locking is required). 315e548cdefSMauro Carvalho Chehab 316e548cdefSMauro Carvalho ChehabRemember the advice above: you can always use 317b1735296SStephen Boydspin_lock_irqsave(), which is a superset of all other 318e548cdefSMauro Carvalho Chehabspinlock primitives. 319e548cdefSMauro Carvalho Chehab 3205b9fd1d3SMauro Carvalho Chehab============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== 3215b9fd1d3SMauro Carvalho Chehab. IRQ Handler A IRQ Handler B Softirq A Softirq B Tasklet A Tasklet B Timer A Timer B User Context A User Context B 3225b9fd1d3SMauro Carvalho Chehab============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== 3235b9fd1d3SMauro Carvalho ChehabIRQ Handler A None 3245b9fd1d3SMauro Carvalho ChehabIRQ Handler B SLIS None 3255b9fd1d3SMauro Carvalho ChehabSoftirq A SLI SLI SL 3265b9fd1d3SMauro Carvalho ChehabSoftirq B SLI SLI SL SL 3275b9fd1d3SMauro Carvalho ChehabTasklet A SLI SLI SL SL None 3285b9fd1d3SMauro Carvalho ChehabTasklet B SLI SLI SL SL SL None 3295b9fd1d3SMauro Carvalho ChehabTimer A SLI SLI SL SL SL SL None 3305b9fd1d3SMauro Carvalho ChehabTimer B SLI SLI SL SL SL SL SL None 3315b9fd1d3SMauro Carvalho ChehabUser Context A SLI SLI SLBH SLBH SLBH SLBH SLBH SLBH None 3325b9fd1d3SMauro Carvalho ChehabUser Context B SLI SLI SLBH SLBH SLBH SLBH SLBH SLBH MLI None 3335b9fd1d3SMauro Carvalho Chehab============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== 334e548cdefSMauro Carvalho Chehab 335e548cdefSMauro Carvalho ChehabTable: Table of Locking Requirements 336e548cdefSMauro Carvalho Chehab 337e548cdefSMauro Carvalho Chehab+--------+----------------------------+ 338e548cdefSMauro Carvalho Chehab| SLIS | spin_lock_irqsave | 339e548cdefSMauro Carvalho Chehab+--------+----------------------------+ 340e548cdefSMauro Carvalho Chehab| SLI | spin_lock_irq | 341e548cdefSMauro Carvalho Chehab+--------+----------------------------+ 342e548cdefSMauro Carvalho Chehab| SL | spin_lock | 343e548cdefSMauro Carvalho Chehab+--------+----------------------------+ 344e548cdefSMauro Carvalho Chehab| SLBH | spin_lock_bh | 345e548cdefSMauro Carvalho Chehab+--------+----------------------------+ 346e548cdefSMauro Carvalho Chehab| MLI | mutex_lock_interruptible | 347e548cdefSMauro Carvalho Chehab+--------+----------------------------+ 348e548cdefSMauro Carvalho Chehab 349e548cdefSMauro Carvalho ChehabTable: Legend for Locking Requirements Table 350e548cdefSMauro Carvalho Chehab 351e548cdefSMauro Carvalho ChehabThe trylock Functions 352e548cdefSMauro Carvalho Chehab===================== 353e548cdefSMauro Carvalho Chehab 354e548cdefSMauro Carvalho ChehabThere are functions that try to acquire a lock only once and immediately 355e548cdefSMauro Carvalho Chehabreturn a value telling about success or failure to acquire the lock. 356e548cdefSMauro Carvalho ChehabThey can be used if you need no access to the data protected with the 357e548cdefSMauro Carvalho Chehablock when some other thread is holding the lock. You should acquire the 358e548cdefSMauro Carvalho Chehablock later if you then need access to the data protected with the lock. 359e548cdefSMauro Carvalho Chehab 360b1735296SStephen Boydspin_trylock() does not spin but returns non-zero if it 361e548cdefSMauro Carvalho Chehabacquires the spinlock on the first try or 0 if not. This function can be 362b1735296SStephen Boydused in all contexts like spin_lock(): you must have 363e548cdefSMauro Carvalho Chehabdisabled the contexts that might interrupt you and acquire the spin 364e548cdefSMauro Carvalho Chehablock. 365e548cdefSMauro Carvalho Chehab 366b1735296SStephen Boydmutex_trylock() does not suspend your task but returns 367e548cdefSMauro Carvalho Chehabnon-zero if it could lock the mutex on the first try or 0 if not. This 368e548cdefSMauro Carvalho Chehabfunction cannot be safely used in hardware or software interrupt 369e548cdefSMauro Carvalho Chehabcontexts despite not sleeping. 370e548cdefSMauro Carvalho Chehab 371e548cdefSMauro Carvalho ChehabCommon Examples 372e548cdefSMauro Carvalho Chehab=============== 373e548cdefSMauro Carvalho Chehab 374e548cdefSMauro Carvalho ChehabLet's step through a simple example: a cache of number to name mappings. 375e548cdefSMauro Carvalho ChehabThe cache keeps a count of how often each of the objects is used, and 376e548cdefSMauro Carvalho Chehabwhen it gets full, throws out the least used one. 377e548cdefSMauro Carvalho Chehab 378e548cdefSMauro Carvalho ChehabAll In User Context 379e548cdefSMauro Carvalho Chehab------------------- 380e548cdefSMauro Carvalho Chehab 381e548cdefSMauro Carvalho ChehabFor our first example, we assume that all operations are in user context 382e548cdefSMauro Carvalho Chehab(ie. from system calls), so we can sleep. This means we can use a mutex 383e548cdefSMauro Carvalho Chehabto protect the cache and all the objects within it. Here's the code:: 384e548cdefSMauro Carvalho Chehab 385e548cdefSMauro Carvalho Chehab #include <linux/list.h> 386e548cdefSMauro Carvalho Chehab #include <linux/slab.h> 387e548cdefSMauro Carvalho Chehab #include <linux/string.h> 388e548cdefSMauro Carvalho Chehab #include <linux/mutex.h> 389e548cdefSMauro Carvalho Chehab #include <asm/errno.h> 390e548cdefSMauro Carvalho Chehab 391e548cdefSMauro Carvalho Chehab struct object 392e548cdefSMauro Carvalho Chehab { 393e548cdefSMauro Carvalho Chehab struct list_head list; 394e548cdefSMauro Carvalho Chehab int id; 395e548cdefSMauro Carvalho Chehab char name[32]; 396e548cdefSMauro Carvalho Chehab int popularity; 397e548cdefSMauro Carvalho Chehab }; 398e548cdefSMauro Carvalho Chehab 399e548cdefSMauro Carvalho Chehab /* Protects the cache, cache_num, and the objects within it */ 400e548cdefSMauro Carvalho Chehab static DEFINE_MUTEX(cache_lock); 401e548cdefSMauro Carvalho Chehab static LIST_HEAD(cache); 402e548cdefSMauro Carvalho Chehab static unsigned int cache_num = 0; 403e548cdefSMauro Carvalho Chehab #define MAX_CACHE_SIZE 10 404e548cdefSMauro Carvalho Chehab 405e548cdefSMauro Carvalho Chehab /* Must be holding cache_lock */ 406e548cdefSMauro Carvalho Chehab static struct object *__cache_find(int id) 407e548cdefSMauro Carvalho Chehab { 408e548cdefSMauro Carvalho Chehab struct object *i; 409e548cdefSMauro Carvalho Chehab 410e548cdefSMauro Carvalho Chehab list_for_each_entry(i, &cache, list) 411e548cdefSMauro Carvalho Chehab if (i->id == id) { 412e548cdefSMauro Carvalho Chehab i->popularity++; 413e548cdefSMauro Carvalho Chehab return i; 414e548cdefSMauro Carvalho Chehab } 415e548cdefSMauro Carvalho Chehab return NULL; 416e548cdefSMauro Carvalho Chehab } 417e548cdefSMauro Carvalho Chehab 418e548cdefSMauro Carvalho Chehab /* Must be holding cache_lock */ 419e548cdefSMauro Carvalho Chehab static void __cache_delete(struct object *obj) 420e548cdefSMauro Carvalho Chehab { 421e548cdefSMauro Carvalho Chehab BUG_ON(!obj); 422e548cdefSMauro Carvalho Chehab list_del(&obj->list); 423e548cdefSMauro Carvalho Chehab kfree(obj); 424e548cdefSMauro Carvalho Chehab cache_num--; 425e548cdefSMauro Carvalho Chehab } 426e548cdefSMauro Carvalho Chehab 427e548cdefSMauro Carvalho Chehab /* Must be holding cache_lock */ 428e548cdefSMauro Carvalho Chehab static void __cache_add(struct object *obj) 429e548cdefSMauro Carvalho Chehab { 430e548cdefSMauro Carvalho Chehab list_add(&obj->list, &cache); 431e548cdefSMauro Carvalho Chehab if (++cache_num > MAX_CACHE_SIZE) { 432e548cdefSMauro Carvalho Chehab struct object *i, *outcast = NULL; 433e548cdefSMauro Carvalho Chehab list_for_each_entry(i, &cache, list) { 434e548cdefSMauro Carvalho Chehab if (!outcast || i->popularity < outcast->popularity) 435e548cdefSMauro Carvalho Chehab outcast = i; 436e548cdefSMauro Carvalho Chehab } 437e548cdefSMauro Carvalho Chehab __cache_delete(outcast); 438e548cdefSMauro Carvalho Chehab } 439e548cdefSMauro Carvalho Chehab } 440e548cdefSMauro Carvalho Chehab 441e548cdefSMauro Carvalho Chehab int cache_add(int id, const char *name) 442e548cdefSMauro Carvalho Chehab { 443e548cdefSMauro Carvalho Chehab struct object *obj; 444e548cdefSMauro Carvalho Chehab 445e548cdefSMauro Carvalho Chehab if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 446e548cdefSMauro Carvalho Chehab return -ENOMEM; 447e548cdefSMauro Carvalho Chehab 448220ee02aSStephen Kitt strscpy(obj->name, name, sizeof(obj->name)); 449e548cdefSMauro Carvalho Chehab obj->id = id; 450e548cdefSMauro Carvalho Chehab obj->popularity = 0; 451e548cdefSMauro Carvalho Chehab 452e548cdefSMauro Carvalho Chehab mutex_lock(&cache_lock); 453e548cdefSMauro Carvalho Chehab __cache_add(obj); 454e548cdefSMauro Carvalho Chehab mutex_unlock(&cache_lock); 455e548cdefSMauro Carvalho Chehab return 0; 456e548cdefSMauro Carvalho Chehab } 457e548cdefSMauro Carvalho Chehab 458e548cdefSMauro Carvalho Chehab void cache_delete(int id) 459e548cdefSMauro Carvalho Chehab { 460e548cdefSMauro Carvalho Chehab mutex_lock(&cache_lock); 461e548cdefSMauro Carvalho Chehab __cache_delete(__cache_find(id)); 462e548cdefSMauro Carvalho Chehab mutex_unlock(&cache_lock); 463e548cdefSMauro Carvalho Chehab } 464e548cdefSMauro Carvalho Chehab 465e548cdefSMauro Carvalho Chehab int cache_find(int id, char *name) 466e548cdefSMauro Carvalho Chehab { 467e548cdefSMauro Carvalho Chehab struct object *obj; 468e548cdefSMauro Carvalho Chehab int ret = -ENOENT; 469e548cdefSMauro Carvalho Chehab 470e548cdefSMauro Carvalho Chehab mutex_lock(&cache_lock); 471e548cdefSMauro Carvalho Chehab obj = __cache_find(id); 472e548cdefSMauro Carvalho Chehab if (obj) { 473e548cdefSMauro Carvalho Chehab ret = 0; 474e548cdefSMauro Carvalho Chehab strcpy(name, obj->name); 475e548cdefSMauro Carvalho Chehab } 476e548cdefSMauro Carvalho Chehab mutex_unlock(&cache_lock); 477e548cdefSMauro Carvalho Chehab return ret; 478e548cdefSMauro Carvalho Chehab } 479e548cdefSMauro Carvalho Chehab 480e548cdefSMauro Carvalho ChehabNote that we always make sure we have the cache_lock when we add, 481e548cdefSMauro Carvalho Chehabdelete, or look up the cache: both the cache infrastructure itself and 482e548cdefSMauro Carvalho Chehabthe contents of the objects are protected by the lock. In this case it's 483e548cdefSMauro Carvalho Chehabeasy, since we copy the data for the user, and never let them access the 484e548cdefSMauro Carvalho Chehabobjects directly. 485e548cdefSMauro Carvalho Chehab 486e548cdefSMauro Carvalho ChehabThere is a slight (and common) optimization here: in 487b1735296SStephen Boydcache_add() we set up the fields of the object before 488e548cdefSMauro Carvalho Chehabgrabbing the lock. This is safe, as no-one else can access it until we 489e548cdefSMauro Carvalho Chehabput it in cache. 490e548cdefSMauro Carvalho Chehab 491e548cdefSMauro Carvalho ChehabAccessing From Interrupt Context 492e548cdefSMauro Carvalho Chehab-------------------------------- 493e548cdefSMauro Carvalho Chehab 494b1735296SStephen BoydNow consider the case where cache_find() can be called 495e548cdefSMauro Carvalho Chehabfrom interrupt context: either a hardware interrupt or a softirq. An 496e548cdefSMauro Carvalho Chehabexample would be a timer which deletes object from the cache. 497e548cdefSMauro Carvalho Chehab 498e548cdefSMauro Carvalho ChehabThe change is shown below, in standard patch format: the ``-`` are lines 499e548cdefSMauro Carvalho Chehabwhich are taken away, and the ``+`` are lines which are added. 500e548cdefSMauro Carvalho Chehab 501e548cdefSMauro Carvalho Chehab:: 502e548cdefSMauro Carvalho Chehab 503e548cdefSMauro Carvalho Chehab --- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 504e548cdefSMauro Carvalho Chehab +++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 505e548cdefSMauro Carvalho Chehab @@ -12,7 +12,7 @@ 506e548cdefSMauro Carvalho Chehab int popularity; 507e548cdefSMauro Carvalho Chehab }; 508e548cdefSMauro Carvalho Chehab 509e548cdefSMauro Carvalho Chehab -static DEFINE_MUTEX(cache_lock); 510e548cdefSMauro Carvalho Chehab +static DEFINE_SPINLOCK(cache_lock); 511e548cdefSMauro Carvalho Chehab static LIST_HEAD(cache); 512e548cdefSMauro Carvalho Chehab static unsigned int cache_num = 0; 513e548cdefSMauro Carvalho Chehab #define MAX_CACHE_SIZE 10 514e548cdefSMauro Carvalho Chehab @@ -55,6 +55,7 @@ 515e548cdefSMauro Carvalho Chehab int cache_add(int id, const char *name) 516e548cdefSMauro Carvalho Chehab { 517e548cdefSMauro Carvalho Chehab struct object *obj; 518e548cdefSMauro Carvalho Chehab + unsigned long flags; 519e548cdefSMauro Carvalho Chehab 520e548cdefSMauro Carvalho Chehab if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 521e548cdefSMauro Carvalho Chehab return -ENOMEM; 522e548cdefSMauro Carvalho Chehab @@ -63,30 +64,33 @@ 523e548cdefSMauro Carvalho Chehab obj->id = id; 524e548cdefSMauro Carvalho Chehab obj->popularity = 0; 525e548cdefSMauro Carvalho Chehab 526e548cdefSMauro Carvalho Chehab - mutex_lock(&cache_lock); 527e548cdefSMauro Carvalho Chehab + spin_lock_irqsave(&cache_lock, flags); 528e548cdefSMauro Carvalho Chehab __cache_add(obj); 529e548cdefSMauro Carvalho Chehab - mutex_unlock(&cache_lock); 530e548cdefSMauro Carvalho Chehab + spin_unlock_irqrestore(&cache_lock, flags); 531e548cdefSMauro Carvalho Chehab return 0; 532e548cdefSMauro Carvalho Chehab } 533e548cdefSMauro Carvalho Chehab 534e548cdefSMauro Carvalho Chehab void cache_delete(int id) 535e548cdefSMauro Carvalho Chehab { 536e548cdefSMauro Carvalho Chehab - mutex_lock(&cache_lock); 537e548cdefSMauro Carvalho Chehab + unsigned long flags; 538e548cdefSMauro Carvalho Chehab + 539e548cdefSMauro Carvalho Chehab + spin_lock_irqsave(&cache_lock, flags); 540e548cdefSMauro Carvalho Chehab __cache_delete(__cache_find(id)); 541e548cdefSMauro Carvalho Chehab - mutex_unlock(&cache_lock); 542e548cdefSMauro Carvalho Chehab + spin_unlock_irqrestore(&cache_lock, flags); 543e548cdefSMauro Carvalho Chehab } 544e548cdefSMauro Carvalho Chehab 545e548cdefSMauro Carvalho Chehab int cache_find(int id, char *name) 546e548cdefSMauro Carvalho Chehab { 547e548cdefSMauro Carvalho Chehab struct object *obj; 548e548cdefSMauro Carvalho Chehab int ret = -ENOENT; 549e548cdefSMauro Carvalho Chehab + unsigned long flags; 550e548cdefSMauro Carvalho Chehab 551e548cdefSMauro Carvalho Chehab - mutex_lock(&cache_lock); 552e548cdefSMauro Carvalho Chehab + spin_lock_irqsave(&cache_lock, flags); 553e548cdefSMauro Carvalho Chehab obj = __cache_find(id); 554e548cdefSMauro Carvalho Chehab if (obj) { 555e548cdefSMauro Carvalho Chehab ret = 0; 556e548cdefSMauro Carvalho Chehab strcpy(name, obj->name); 557e548cdefSMauro Carvalho Chehab } 558e548cdefSMauro Carvalho Chehab - mutex_unlock(&cache_lock); 559e548cdefSMauro Carvalho Chehab + spin_unlock_irqrestore(&cache_lock, flags); 560e548cdefSMauro Carvalho Chehab return ret; 561e548cdefSMauro Carvalho Chehab } 562e548cdefSMauro Carvalho Chehab 563b1735296SStephen BoydNote that the spin_lock_irqsave() will turn off 564e548cdefSMauro Carvalho Chehabinterrupts if they are on, otherwise does nothing (if we are already in 565e548cdefSMauro Carvalho Chehaban interrupt handler), hence these functions are safe to call from any 566e548cdefSMauro Carvalho Chehabcontext. 567e548cdefSMauro Carvalho Chehab 568b1735296SStephen BoydUnfortunately, cache_add() calls kmalloc() 569e548cdefSMauro Carvalho Chehabwith the ``GFP_KERNEL`` flag, which is only legal in user context. I 570b1735296SStephen Boydhave assumed that cache_add() is still only called in 571e548cdefSMauro Carvalho Chehabuser context, otherwise this should become a parameter to 572b1735296SStephen Boydcache_add(). 573e548cdefSMauro Carvalho Chehab 574e548cdefSMauro Carvalho ChehabExposing Objects Outside This File 575e548cdefSMauro Carvalho Chehab---------------------------------- 576e548cdefSMauro Carvalho Chehab 577e548cdefSMauro Carvalho ChehabIf our objects contained more information, it might not be sufficient to 578e548cdefSMauro Carvalho Chehabcopy the information in and out: other parts of the code might want to 579e548cdefSMauro Carvalho Chehabkeep pointers to these objects, for example, rather than looking up the 580e548cdefSMauro Carvalho Chehabid every time. This produces two problems. 581e548cdefSMauro Carvalho Chehab 582e548cdefSMauro Carvalho ChehabThe first problem is that we use the ``cache_lock`` to protect objects: 583e548cdefSMauro Carvalho Chehabwe'd need to make this non-static so the rest of the code can use it. 584e548cdefSMauro Carvalho ChehabThis makes locking trickier, as it is no longer all in one place. 585e548cdefSMauro Carvalho Chehab 586e548cdefSMauro Carvalho ChehabThe second problem is the lifetime problem: if another structure keeps a 587e548cdefSMauro Carvalho Chehabpointer to an object, it presumably expects that pointer to remain 588e548cdefSMauro Carvalho Chehabvalid. Unfortunately, this is only guaranteed while you hold the lock, 589b1735296SStephen Boydotherwise someone might call cache_delete() and even 590e548cdefSMauro Carvalho Chehabworse, add another object, re-using the same address. 591e548cdefSMauro Carvalho Chehab 592e548cdefSMauro Carvalho ChehabAs there is only one lock, you can't hold it forever: no-one else would 593e548cdefSMauro Carvalho Chehabget any work done. 594e548cdefSMauro Carvalho Chehab 595e548cdefSMauro Carvalho ChehabThe solution to this problem is to use a reference count: everyone who 596e548cdefSMauro Carvalho Chehabhas a pointer to the object increases it when they first get the object, 597e548cdefSMauro Carvalho Chehaband drops the reference count when they're finished with it. Whoever 598e548cdefSMauro Carvalho Chehabdrops it to zero knows it is unused, and can actually delete it. 599e548cdefSMauro Carvalho Chehab 600e548cdefSMauro Carvalho ChehabHere is the code:: 601e548cdefSMauro Carvalho Chehab 602e548cdefSMauro Carvalho Chehab --- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 603e548cdefSMauro Carvalho Chehab +++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 604e548cdefSMauro Carvalho Chehab @@ -7,6 +7,7 @@ 605e548cdefSMauro Carvalho Chehab struct object 606e548cdefSMauro Carvalho Chehab { 607e548cdefSMauro Carvalho Chehab struct list_head list; 608e548cdefSMauro Carvalho Chehab + unsigned int refcnt; 609e548cdefSMauro Carvalho Chehab int id; 610e548cdefSMauro Carvalho Chehab char name[32]; 611e548cdefSMauro Carvalho Chehab int popularity; 612e548cdefSMauro Carvalho Chehab @@ -17,6 +18,35 @@ 613e548cdefSMauro Carvalho Chehab static unsigned int cache_num = 0; 614e548cdefSMauro Carvalho Chehab #define MAX_CACHE_SIZE 10 615e548cdefSMauro Carvalho Chehab 616e548cdefSMauro Carvalho Chehab +static void __object_put(struct object *obj) 617e548cdefSMauro Carvalho Chehab +{ 618e548cdefSMauro Carvalho Chehab + if (--obj->refcnt == 0) 619e548cdefSMauro Carvalho Chehab + kfree(obj); 620e548cdefSMauro Carvalho Chehab +} 621e548cdefSMauro Carvalho Chehab + 622e548cdefSMauro Carvalho Chehab +static void __object_get(struct object *obj) 623e548cdefSMauro Carvalho Chehab +{ 624e548cdefSMauro Carvalho Chehab + obj->refcnt++; 625e548cdefSMauro Carvalho Chehab +} 626e548cdefSMauro Carvalho Chehab + 627e548cdefSMauro Carvalho Chehab +void object_put(struct object *obj) 628e548cdefSMauro Carvalho Chehab +{ 629e548cdefSMauro Carvalho Chehab + unsigned long flags; 630e548cdefSMauro Carvalho Chehab + 631e548cdefSMauro Carvalho Chehab + spin_lock_irqsave(&cache_lock, flags); 632e548cdefSMauro Carvalho Chehab + __object_put(obj); 633e548cdefSMauro Carvalho Chehab + spin_unlock_irqrestore(&cache_lock, flags); 634e548cdefSMauro Carvalho Chehab +} 635e548cdefSMauro Carvalho Chehab + 636e548cdefSMauro Carvalho Chehab +void object_get(struct object *obj) 637e548cdefSMauro Carvalho Chehab +{ 638e548cdefSMauro Carvalho Chehab + unsigned long flags; 639e548cdefSMauro Carvalho Chehab + 640e548cdefSMauro Carvalho Chehab + spin_lock_irqsave(&cache_lock, flags); 641e548cdefSMauro Carvalho Chehab + __object_get(obj); 642e548cdefSMauro Carvalho Chehab + spin_unlock_irqrestore(&cache_lock, flags); 643e548cdefSMauro Carvalho Chehab +} 644e548cdefSMauro Carvalho Chehab + 645e548cdefSMauro Carvalho Chehab /* Must be holding cache_lock */ 646e548cdefSMauro Carvalho Chehab static struct object *__cache_find(int id) 647e548cdefSMauro Carvalho Chehab { 648e548cdefSMauro Carvalho Chehab @@ -35,6 +65,7 @@ 649e548cdefSMauro Carvalho Chehab { 650e548cdefSMauro Carvalho Chehab BUG_ON(!obj); 651e548cdefSMauro Carvalho Chehab list_del(&obj->list); 652e548cdefSMauro Carvalho Chehab + __object_put(obj); 653e548cdefSMauro Carvalho Chehab cache_num--; 654e548cdefSMauro Carvalho Chehab } 655e548cdefSMauro Carvalho Chehab 656e548cdefSMauro Carvalho Chehab @@ -63,6 +94,7 @@ 657220ee02aSStephen Kitt strscpy(obj->name, name, sizeof(obj->name)); 658e548cdefSMauro Carvalho Chehab obj->id = id; 659e548cdefSMauro Carvalho Chehab obj->popularity = 0; 660e548cdefSMauro Carvalho Chehab + obj->refcnt = 1; /* The cache holds a reference */ 661e548cdefSMauro Carvalho Chehab 662e548cdefSMauro Carvalho Chehab spin_lock_irqsave(&cache_lock, flags); 663e548cdefSMauro Carvalho Chehab __cache_add(obj); 664e548cdefSMauro Carvalho Chehab @@ -79,18 +111,15 @@ 665e548cdefSMauro Carvalho Chehab spin_unlock_irqrestore(&cache_lock, flags); 666e548cdefSMauro Carvalho Chehab } 667e548cdefSMauro Carvalho Chehab 668e548cdefSMauro Carvalho Chehab -int cache_find(int id, char *name) 669e548cdefSMauro Carvalho Chehab +struct object *cache_find(int id) 670e548cdefSMauro Carvalho Chehab { 671e548cdefSMauro Carvalho Chehab struct object *obj; 672e548cdefSMauro Carvalho Chehab - int ret = -ENOENT; 673e548cdefSMauro Carvalho Chehab unsigned long flags; 674e548cdefSMauro Carvalho Chehab 675e548cdefSMauro Carvalho Chehab spin_lock_irqsave(&cache_lock, flags); 676e548cdefSMauro Carvalho Chehab obj = __cache_find(id); 677e548cdefSMauro Carvalho Chehab - if (obj) { 678e548cdefSMauro Carvalho Chehab - ret = 0; 679e548cdefSMauro Carvalho Chehab - strcpy(name, obj->name); 680e548cdefSMauro Carvalho Chehab - } 681e548cdefSMauro Carvalho Chehab + if (obj) 682e548cdefSMauro Carvalho Chehab + __object_get(obj); 683e548cdefSMauro Carvalho Chehab spin_unlock_irqrestore(&cache_lock, flags); 684e548cdefSMauro Carvalho Chehab - return ret; 685e548cdefSMauro Carvalho Chehab + return obj; 686e548cdefSMauro Carvalho Chehab } 687e548cdefSMauro Carvalho Chehab 688e548cdefSMauro Carvalho ChehabWe encapsulate the reference counting in the standard 'get' and 'put' 689e548cdefSMauro Carvalho Chehabfunctions. Now we can return the object itself from 690b1735296SStephen Boydcache_find() which has the advantage that the user can 691b1735296SStephen Boydnow sleep holding the object (eg. to copy_to_user() to 692e548cdefSMauro Carvalho Chehabname to userspace). 693e548cdefSMauro Carvalho Chehab 694e548cdefSMauro Carvalho ChehabThe other point to note is that I said a reference should be held for 695e548cdefSMauro Carvalho Chehabevery pointer to the object: thus the reference count is 1 when first 696e548cdefSMauro Carvalho Chehabinserted into the cache. In some versions the framework does not hold a 697e548cdefSMauro Carvalho Chehabreference count, but they are more complicated. 698e548cdefSMauro Carvalho Chehab 699e548cdefSMauro Carvalho ChehabUsing Atomic Operations For The Reference Count 700e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 701e548cdefSMauro Carvalho Chehab 702dc89fca9SMauro Carvalho ChehabIn practice, :c:type:`atomic_t` would usually be used for refcnt. There are a 703e548cdefSMauro Carvalho Chehabnumber of atomic operations defined in ``include/asm/atomic.h``: these 704e548cdefSMauro Carvalho Chehabare guaranteed to be seen atomically from all CPUs in the system, so no 705e548cdefSMauro Carvalho Chehablock is required. In this case, it is simpler than using spinlocks, 706e548cdefSMauro Carvalho Chehabalthough for anything non-trivial using spinlocks is clearer. The 707b1735296SStephen Boydatomic_inc() and atomic_dec_and_test() 708e548cdefSMauro Carvalho Chehabare used instead of the standard increment and decrement operators, and 709e548cdefSMauro Carvalho Chehabthe lock is no longer used to protect the reference count itself. 710e548cdefSMauro Carvalho Chehab 711e548cdefSMauro Carvalho Chehab:: 712e548cdefSMauro Carvalho Chehab 713e548cdefSMauro Carvalho Chehab --- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 714e548cdefSMauro Carvalho Chehab +++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 715e548cdefSMauro Carvalho Chehab @@ -7,7 +7,7 @@ 716e548cdefSMauro Carvalho Chehab struct object 717e548cdefSMauro Carvalho Chehab { 718e548cdefSMauro Carvalho Chehab struct list_head list; 719e548cdefSMauro Carvalho Chehab - unsigned int refcnt; 720e548cdefSMauro Carvalho Chehab + atomic_t refcnt; 721e548cdefSMauro Carvalho Chehab int id; 722e548cdefSMauro Carvalho Chehab char name[32]; 723e548cdefSMauro Carvalho Chehab int popularity; 724e548cdefSMauro Carvalho Chehab @@ -18,33 +18,15 @@ 725e548cdefSMauro Carvalho Chehab static unsigned int cache_num = 0; 726e548cdefSMauro Carvalho Chehab #define MAX_CACHE_SIZE 10 727e548cdefSMauro Carvalho Chehab 728e548cdefSMauro Carvalho Chehab -static void __object_put(struct object *obj) 729e548cdefSMauro Carvalho Chehab -{ 730e548cdefSMauro Carvalho Chehab - if (--obj->refcnt == 0) 731e548cdefSMauro Carvalho Chehab - kfree(obj); 732e548cdefSMauro Carvalho Chehab -} 733e548cdefSMauro Carvalho Chehab - 734e548cdefSMauro Carvalho Chehab -static void __object_get(struct object *obj) 735e548cdefSMauro Carvalho Chehab -{ 736e548cdefSMauro Carvalho Chehab - obj->refcnt++; 737e548cdefSMauro Carvalho Chehab -} 738e548cdefSMauro Carvalho Chehab - 739e548cdefSMauro Carvalho Chehab void object_put(struct object *obj) 740e548cdefSMauro Carvalho Chehab { 741e548cdefSMauro Carvalho Chehab - unsigned long flags; 742e548cdefSMauro Carvalho Chehab - 743e548cdefSMauro Carvalho Chehab - spin_lock_irqsave(&cache_lock, flags); 744e548cdefSMauro Carvalho Chehab - __object_put(obj); 745e548cdefSMauro Carvalho Chehab - spin_unlock_irqrestore(&cache_lock, flags); 746e548cdefSMauro Carvalho Chehab + if (atomic_dec_and_test(&obj->refcnt)) 747e548cdefSMauro Carvalho Chehab + kfree(obj); 748e548cdefSMauro Carvalho Chehab } 749e548cdefSMauro Carvalho Chehab 750e548cdefSMauro Carvalho Chehab void object_get(struct object *obj) 751e548cdefSMauro Carvalho Chehab { 752e548cdefSMauro Carvalho Chehab - unsigned long flags; 753e548cdefSMauro Carvalho Chehab - 754e548cdefSMauro Carvalho Chehab - spin_lock_irqsave(&cache_lock, flags); 755e548cdefSMauro Carvalho Chehab - __object_get(obj); 756e548cdefSMauro Carvalho Chehab - spin_unlock_irqrestore(&cache_lock, flags); 757e548cdefSMauro Carvalho Chehab + atomic_inc(&obj->refcnt); 758e548cdefSMauro Carvalho Chehab } 759e548cdefSMauro Carvalho Chehab 760e548cdefSMauro Carvalho Chehab /* Must be holding cache_lock */ 761e548cdefSMauro Carvalho Chehab @@ -65,7 +47,7 @@ 762e548cdefSMauro Carvalho Chehab { 763e548cdefSMauro Carvalho Chehab BUG_ON(!obj); 764e548cdefSMauro Carvalho Chehab list_del(&obj->list); 765e548cdefSMauro Carvalho Chehab - __object_put(obj); 766e548cdefSMauro Carvalho Chehab + object_put(obj); 767e548cdefSMauro Carvalho Chehab cache_num--; 768e548cdefSMauro Carvalho Chehab } 769e548cdefSMauro Carvalho Chehab 770e548cdefSMauro Carvalho Chehab @@ -94,7 +76,7 @@ 771220ee02aSStephen Kitt strscpy(obj->name, name, sizeof(obj->name)); 772e548cdefSMauro Carvalho Chehab obj->id = id; 773e548cdefSMauro Carvalho Chehab obj->popularity = 0; 774e548cdefSMauro Carvalho Chehab - obj->refcnt = 1; /* The cache holds a reference */ 775e548cdefSMauro Carvalho Chehab + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 776e548cdefSMauro Carvalho Chehab 777e548cdefSMauro Carvalho Chehab spin_lock_irqsave(&cache_lock, flags); 778e548cdefSMauro Carvalho Chehab __cache_add(obj); 779e548cdefSMauro Carvalho Chehab @@ -119,7 +101,7 @@ 780e548cdefSMauro Carvalho Chehab spin_lock_irqsave(&cache_lock, flags); 781e548cdefSMauro Carvalho Chehab obj = __cache_find(id); 782e548cdefSMauro Carvalho Chehab if (obj) 783e548cdefSMauro Carvalho Chehab - __object_get(obj); 784e548cdefSMauro Carvalho Chehab + object_get(obj); 785e548cdefSMauro Carvalho Chehab spin_unlock_irqrestore(&cache_lock, flags); 786e548cdefSMauro Carvalho Chehab return obj; 787e548cdefSMauro Carvalho Chehab } 788e548cdefSMauro Carvalho Chehab 789e548cdefSMauro Carvalho ChehabProtecting The Objects Themselves 790e548cdefSMauro Carvalho Chehab--------------------------------- 791e548cdefSMauro Carvalho Chehab 792e548cdefSMauro Carvalho ChehabIn these examples, we assumed that the objects (except the reference 793e548cdefSMauro Carvalho Chehabcounts) never changed once they are created. If we wanted to allow the 794e548cdefSMauro Carvalho Chehabname to change, there are three possibilities: 795e548cdefSMauro Carvalho Chehab 796e548cdefSMauro Carvalho Chehab- You can make ``cache_lock`` non-static, and tell people to grab that 797e548cdefSMauro Carvalho Chehab lock before changing the name in any object. 798e548cdefSMauro Carvalho Chehab 799b1735296SStephen Boyd- You can provide a cache_obj_rename() which grabs this 800e548cdefSMauro Carvalho Chehab lock and changes the name for the caller, and tell everyone to use 801e548cdefSMauro Carvalho Chehab that function. 802e548cdefSMauro Carvalho Chehab 803e548cdefSMauro Carvalho Chehab- You can make the ``cache_lock`` protect only the cache itself, and 804e548cdefSMauro Carvalho Chehab use another lock to protect the name. 805e548cdefSMauro Carvalho Chehab 806e548cdefSMauro Carvalho ChehabTheoretically, you can make the locks as fine-grained as one lock for 807e548cdefSMauro Carvalho Chehabevery field, for every object. In practice, the most common variants 808e548cdefSMauro Carvalho Chehabare: 809e548cdefSMauro Carvalho Chehab 810e548cdefSMauro Carvalho Chehab- One lock which protects the infrastructure (the ``cache`` list in 811e548cdefSMauro Carvalho Chehab this example) and all the objects. This is what we have done so far. 812e548cdefSMauro Carvalho Chehab 813e548cdefSMauro Carvalho Chehab- One lock which protects the infrastructure (including the list 814e548cdefSMauro Carvalho Chehab pointers inside the objects), and one lock inside the object which 815e548cdefSMauro Carvalho Chehab protects the rest of that object. 816e548cdefSMauro Carvalho Chehab 817e548cdefSMauro Carvalho Chehab- Multiple locks to protect the infrastructure (eg. one lock per hash 818e548cdefSMauro Carvalho Chehab chain), possibly with a separate per-object lock. 819e548cdefSMauro Carvalho Chehab 820e548cdefSMauro Carvalho ChehabHere is the "lock-per-object" implementation: 821e548cdefSMauro Carvalho Chehab 822e548cdefSMauro Carvalho Chehab:: 823e548cdefSMauro Carvalho Chehab 824e548cdefSMauro Carvalho Chehab --- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 825e548cdefSMauro Carvalho Chehab +++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 826e548cdefSMauro Carvalho Chehab @@ -6,11 +6,17 @@ 827e548cdefSMauro Carvalho Chehab 828e548cdefSMauro Carvalho Chehab struct object 829e548cdefSMauro Carvalho Chehab { 830e548cdefSMauro Carvalho Chehab + /* These two protected by cache_lock. */ 831e548cdefSMauro Carvalho Chehab struct list_head list; 832e548cdefSMauro Carvalho Chehab + int popularity; 833e548cdefSMauro Carvalho Chehab + 834e548cdefSMauro Carvalho Chehab atomic_t refcnt; 835e548cdefSMauro Carvalho Chehab + 836e548cdefSMauro Carvalho Chehab + /* Doesn't change once created. */ 837e548cdefSMauro Carvalho Chehab int id; 838e548cdefSMauro Carvalho Chehab + 839e548cdefSMauro Carvalho Chehab + spinlock_t lock; /* Protects the name */ 840e548cdefSMauro Carvalho Chehab char name[32]; 841e548cdefSMauro Carvalho Chehab - int popularity; 842e548cdefSMauro Carvalho Chehab }; 843e548cdefSMauro Carvalho Chehab 844e548cdefSMauro Carvalho Chehab static DEFINE_SPINLOCK(cache_lock); 845e548cdefSMauro Carvalho Chehab @@ -77,6 +84,7 @@ 846e548cdefSMauro Carvalho Chehab obj->id = id; 847e548cdefSMauro Carvalho Chehab obj->popularity = 0; 848e548cdefSMauro Carvalho Chehab atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 849e548cdefSMauro Carvalho Chehab + spin_lock_init(&obj->lock); 850e548cdefSMauro Carvalho Chehab 851e548cdefSMauro Carvalho Chehab spin_lock_irqsave(&cache_lock, flags); 852e548cdefSMauro Carvalho Chehab __cache_add(obj); 853e548cdefSMauro Carvalho Chehab 854e548cdefSMauro Carvalho ChehabNote that I decide that the popularity count should be protected by the 855e548cdefSMauro Carvalho Chehab``cache_lock`` rather than the per-object lock: this is because it (like 856e548cdefSMauro Carvalho Chehabthe :c:type:`struct list_head <list_head>` inside the object) 857e548cdefSMauro Carvalho Chehabis logically part of the infrastructure. This way, I don't need to grab 858b1735296SStephen Boydthe lock of every object in __cache_add() when seeking 859e548cdefSMauro Carvalho Chehabthe least popular. 860e548cdefSMauro Carvalho Chehab 861e548cdefSMauro Carvalho ChehabI also decided that the id member is unchangeable, so I don't need to 862b1735296SStephen Boydgrab each object lock in __cache_find() to examine the 863e548cdefSMauro Carvalho Chehabid: the object lock is only used by a caller who wants to read or write 864e548cdefSMauro Carvalho Chehabthe name field. 865e548cdefSMauro Carvalho Chehab 866e548cdefSMauro Carvalho ChehabNote also that I added a comment describing what data was protected by 867e548cdefSMauro Carvalho Chehabwhich locks. This is extremely important, as it describes the runtime 868e548cdefSMauro Carvalho Chehabbehavior of the code, and can be hard to gain from just reading. And as 869e548cdefSMauro Carvalho ChehabAlan Cox says, “Lock data, not code”. 870e548cdefSMauro Carvalho Chehab 871e548cdefSMauro Carvalho ChehabCommon Problems 872e548cdefSMauro Carvalho Chehab=============== 873e548cdefSMauro Carvalho Chehab 874e548cdefSMauro Carvalho ChehabDeadlock: Simple and Advanced 875e548cdefSMauro Carvalho Chehab----------------------------- 876e548cdefSMauro Carvalho Chehab 877e548cdefSMauro Carvalho ChehabThere is a coding bug where a piece of code tries to grab a spinlock 878e548cdefSMauro Carvalho Chehabtwice: it will spin forever, waiting for the lock to be released 879e548cdefSMauro Carvalho Chehab(spinlocks, rwlocks and mutexes are not recursive in Linux). This is 880e548cdefSMauro Carvalho Chehabtrivial to diagnose: not a 881e548cdefSMauro Carvalho Chehabstay-up-five-nights-talk-to-fluffy-code-bunnies kind of problem. 882e548cdefSMauro Carvalho Chehab 883e548cdefSMauro Carvalho ChehabFor a slightly more complex case, imagine you have a region shared by a 884b1735296SStephen Boydsoftirq and user context. If you use a spin_lock() call 885e548cdefSMauro Carvalho Chehabto protect it, it is possible that the user context will be interrupted 886e548cdefSMauro Carvalho Chehabby the softirq while it holds the lock, and the softirq will then spin 887e548cdefSMauro Carvalho Chehabforever trying to get the same lock. 888e548cdefSMauro Carvalho Chehab 889e548cdefSMauro Carvalho ChehabBoth of these are called deadlock, and as shown above, it can occur even 890e548cdefSMauro Carvalho Chehabwith a single CPU (although not on UP compiles, since spinlocks vanish 891e548cdefSMauro Carvalho Chehabon kernel compiles with ``CONFIG_SMP``\ =n. You'll still get data 892e548cdefSMauro Carvalho Chehabcorruption in the second example). 893e548cdefSMauro Carvalho Chehab 894e548cdefSMauro Carvalho ChehabThis complete lockup is easy to diagnose: on SMP boxes the watchdog 895e548cdefSMauro Carvalho Chehabtimer or compiling with ``DEBUG_SPINLOCK`` set 896e548cdefSMauro Carvalho Chehab(``include/linux/spinlock.h``) will show this up immediately when it 897e548cdefSMauro Carvalho Chehabhappens. 898e548cdefSMauro Carvalho Chehab 899e548cdefSMauro Carvalho ChehabA more complex problem is the so-called 'deadly embrace', involving two 900e548cdefSMauro Carvalho Chehabor more locks. Say you have a hash table: each entry in the table is a 901e548cdefSMauro Carvalho Chehabspinlock, and a chain of hashed objects. Inside a softirq handler, you 902e548cdefSMauro Carvalho Chehabsometimes want to alter an object from one place in the hash to another: 903e548cdefSMauro Carvalho Chehabyou grab the spinlock of the old hash chain and the spinlock of the new 904e548cdefSMauro Carvalho Chehabhash chain, and delete the object from the old one, and insert it in the 905e548cdefSMauro Carvalho Chehabnew one. 906e548cdefSMauro Carvalho Chehab 907e548cdefSMauro Carvalho ChehabThere are two problems here. First, if your code ever tries to move the 908e548cdefSMauro Carvalho Chehabobject to the same chain, it will deadlock with itself as it tries to 909e548cdefSMauro Carvalho Chehablock it twice. Secondly, if the same softirq on another CPU is trying to 910e548cdefSMauro Carvalho Chehabmove another object in the reverse direction, the following could 911e548cdefSMauro Carvalho Chehabhappen: 912e548cdefSMauro Carvalho Chehab 913e548cdefSMauro Carvalho Chehab+-----------------------+-----------------------+ 914e548cdefSMauro Carvalho Chehab| CPU 1 | CPU 2 | 915e548cdefSMauro Carvalho Chehab+=======================+=======================+ 916e548cdefSMauro Carvalho Chehab| Grab lock A -> OK | Grab lock B -> OK | 917e548cdefSMauro Carvalho Chehab+-----------------------+-----------------------+ 918e548cdefSMauro Carvalho Chehab| Grab lock B -> spin | Grab lock A -> spin | 919e548cdefSMauro Carvalho Chehab+-----------------------+-----------------------+ 920e548cdefSMauro Carvalho Chehab 921e548cdefSMauro Carvalho ChehabTable: Consequences 922e548cdefSMauro Carvalho Chehab 923e548cdefSMauro Carvalho ChehabThe two CPUs will spin forever, waiting for the other to give up their 924e548cdefSMauro Carvalho Chehablock. It will look, smell, and feel like a crash. 925e548cdefSMauro Carvalho Chehab 926e548cdefSMauro Carvalho ChehabPreventing Deadlock 927e548cdefSMauro Carvalho Chehab------------------- 928e548cdefSMauro Carvalho Chehab 929e548cdefSMauro Carvalho ChehabTextbooks will tell you that if you always lock in the same order, you 930e548cdefSMauro Carvalho Chehabwill never get this kind of deadlock. Practice will tell you that this 931e548cdefSMauro Carvalho Chehabapproach doesn't scale: when I create a new lock, I don't understand 932e548cdefSMauro Carvalho Chehabenough of the kernel to figure out where in the 5000 lock hierarchy it 933e548cdefSMauro Carvalho Chehabwill fit. 934e548cdefSMauro Carvalho Chehab 935e548cdefSMauro Carvalho ChehabThe best locks are encapsulated: they never get exposed in headers, and 936e548cdefSMauro Carvalho Chehabare never held around calls to non-trivial functions outside the same 937e548cdefSMauro Carvalho Chehabfile. You can read through this code and see that it will never 938e548cdefSMauro Carvalho Chehabdeadlock, because it never tries to grab another lock while it has that 939e548cdefSMauro Carvalho Chehabone. People using your code don't even need to know you are using a 940e548cdefSMauro Carvalho Chehablock. 941e548cdefSMauro Carvalho Chehab 942e548cdefSMauro Carvalho ChehabA classic problem here is when you provide callbacks or hooks: if you 943e548cdefSMauro Carvalho Chehabcall these with the lock held, you risk simple deadlock, or a deadly 944f35cf1a5SKonstantin Ryabitsevembrace (who knows what the callback will do?). 945e548cdefSMauro Carvalho Chehab 946e548cdefSMauro Carvalho ChehabOverzealous Prevention Of Deadlocks 947e548cdefSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 948e548cdefSMauro Carvalho Chehab 949e548cdefSMauro Carvalho ChehabDeadlocks are problematic, but not as bad as data corruption. Code which 950e548cdefSMauro Carvalho Chehabgrabs a read lock, searches a list, fails to find what it wants, drops 951e548cdefSMauro Carvalho Chehabthe read lock, grabs a write lock and inserts the object has a race 952e548cdefSMauro Carvalho Chehabcondition. 953e548cdefSMauro Carvalho Chehab 954e548cdefSMauro Carvalho ChehabRacing Timers: A Kernel Pastime 955e548cdefSMauro Carvalho Chehab------------------------------- 956e548cdefSMauro Carvalho Chehab 957e548cdefSMauro Carvalho ChehabTimers can produce their own special problems with races. Consider a 958e548cdefSMauro Carvalho Chehabcollection of objects (list, hash, etc) where each object has a timer 959e548cdefSMauro Carvalho Chehabwhich is due to destroy it. 960e548cdefSMauro Carvalho Chehab 961e548cdefSMauro Carvalho ChehabIf you want to destroy the entire collection (say on module removal), 962e548cdefSMauro Carvalho Chehabyou might do the following:: 963e548cdefSMauro Carvalho Chehab 964e548cdefSMauro Carvalho Chehab /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE 965e548cdefSMauro Carvalho Chehab HUNGARIAN NOTATION */ 966e548cdefSMauro Carvalho Chehab spin_lock_bh(&list_lock); 967e548cdefSMauro Carvalho Chehab 968e548cdefSMauro Carvalho Chehab while (list) { 969e548cdefSMauro Carvalho Chehab struct foo *next = list->next; 97087bdd932SThomas Gleixner timer_delete(&list->timer); 971e548cdefSMauro Carvalho Chehab kfree(list); 972e548cdefSMauro Carvalho Chehab list = next; 973e548cdefSMauro Carvalho Chehab } 974e548cdefSMauro Carvalho Chehab 975e548cdefSMauro Carvalho Chehab spin_unlock_bh(&list_lock); 976e548cdefSMauro Carvalho Chehab 977e548cdefSMauro Carvalho Chehab 978e548cdefSMauro Carvalho ChehabSooner or later, this will crash on SMP, because a timer can have just 979b1735296SStephen Boydgone off before the spin_lock_bh(), and it will only get 980b1735296SStephen Boydthe lock after we spin_unlock_bh(), and then try to free 981e548cdefSMauro Carvalho Chehabthe element (which has already been freed!). 982e548cdefSMauro Carvalho Chehab 983e548cdefSMauro Carvalho ChehabThis can be avoided by checking the result of 98487bdd932SThomas Gleixnertimer_delete(): if it returns 1, the timer has been deleted. 985e548cdefSMauro Carvalho ChehabIf 0, it means (in this case) that it is currently running, so we can 986e548cdefSMauro Carvalho Chehabdo:: 987e548cdefSMauro Carvalho Chehab 988e548cdefSMauro Carvalho Chehab retry: 989e548cdefSMauro Carvalho Chehab spin_lock_bh(&list_lock); 990e548cdefSMauro Carvalho Chehab 991e548cdefSMauro Carvalho Chehab while (list) { 992e548cdefSMauro Carvalho Chehab struct foo *next = list->next; 99387bdd932SThomas Gleixner if (!timer_delete(&list->timer)) { 994e548cdefSMauro Carvalho Chehab /* Give timer a chance to delete this */ 995e548cdefSMauro Carvalho Chehab spin_unlock_bh(&list_lock); 996e548cdefSMauro Carvalho Chehab goto retry; 997e548cdefSMauro Carvalho Chehab } 998e548cdefSMauro Carvalho Chehab kfree(list); 999e548cdefSMauro Carvalho Chehab list = next; 1000e548cdefSMauro Carvalho Chehab } 1001e548cdefSMauro Carvalho Chehab 1002e548cdefSMauro Carvalho Chehab spin_unlock_bh(&list_lock); 1003e548cdefSMauro Carvalho Chehab 1004e548cdefSMauro Carvalho Chehab 1005e548cdefSMauro Carvalho ChehabAnother common problem is deleting timers which restart themselves (by 1006b1735296SStephen Boydcalling add_timer() at the end of their timer function). 1007e548cdefSMauro Carvalho ChehabBecause this is a fairly common case which is prone to races, you should 100887bdd932SThomas Gleixneruse timer_delete_sync() (``include/linux/timer.h``) to handle this case. 1009e548cdefSMauro Carvalho Chehab 1010a31323beSSteven Rostedt (Google)Before freeing a timer, timer_shutdown() or timer_shutdown_sync() should be 1011a31323beSSteven Rostedt (Google)called which will keep it from being rearmed. Any subsequent attempt to 1012a31323beSSteven Rostedt (Google)rearm the timer will be silently ignored by the core code. 1013a31323beSSteven Rostedt (Google) 1014a31323beSSteven Rostedt (Google) 1015e548cdefSMauro Carvalho ChehabLocking Speed 1016e548cdefSMauro Carvalho Chehab============= 1017e548cdefSMauro Carvalho Chehab 1018e548cdefSMauro Carvalho ChehabThere are three main things to worry about when considering speed of 1019e548cdefSMauro Carvalho Chehabsome code which does locking. First is concurrency: how many things are 1020e548cdefSMauro Carvalho Chehabgoing to be waiting while someone else is holding a lock. Second is the 1021e548cdefSMauro Carvalho Chehabtime taken to actually acquire and release an uncontended lock. Third is 1022e548cdefSMauro Carvalho Chehabusing fewer, or smarter locks. I'm assuming that the lock is used fairly 1023e548cdefSMauro Carvalho Chehaboften: otherwise, you wouldn't be concerned about efficiency. 1024e548cdefSMauro Carvalho Chehab 1025e548cdefSMauro Carvalho ChehabConcurrency depends on how long the lock is usually held: you should 1026e548cdefSMauro Carvalho Chehabhold the lock for as long as needed, but no longer. In the cache 1027e548cdefSMauro Carvalho Chehabexample, we always create the object without the lock held, and then 1028e548cdefSMauro Carvalho Chehabgrab the lock only when we are ready to insert it in the list. 1029e548cdefSMauro Carvalho Chehab 1030e548cdefSMauro Carvalho ChehabAcquisition times depend on how much damage the lock operations do to 1031e548cdefSMauro Carvalho Chehabthe pipeline (pipeline stalls) and how likely it is that this CPU was 1032e548cdefSMauro Carvalho Chehabthe last one to grab the lock (ie. is the lock cache-hot for this CPU): 1033e548cdefSMauro Carvalho Chehabon a machine with more CPUs, this likelihood drops fast. Consider a 1034e548cdefSMauro Carvalho Chehab700MHz Intel Pentium III: an instruction takes about 0.7ns, an atomic 1035e548cdefSMauro Carvalho Chehabincrement takes about 58ns, a lock which is cache-hot on this CPU takes 1036e548cdefSMauro Carvalho Chehab160ns, and a cacheline transfer from another CPU takes an additional 170 1037e548cdefSMauro Carvalho Chehabto 360ns. (These figures from Paul McKenney's `Linux Journal RCU 1038e548cdefSMauro Carvalho Chehabarticle <http://www.linuxjournal.com/article.php?sid=6993>`__). 1039e548cdefSMauro Carvalho Chehab 1040e548cdefSMauro Carvalho ChehabThese two aims conflict: holding a lock for a short time might be done 1041e548cdefSMauro Carvalho Chehabby splitting locks into parts (such as in our final per-object-lock 1042e548cdefSMauro Carvalho Chehabexample), but this increases the number of lock acquisitions, and the 1043e548cdefSMauro Carvalho Chehabresults are often slower than having a single lock. This is another 1044e548cdefSMauro Carvalho Chehabreason to advocate locking simplicity. 1045e548cdefSMauro Carvalho Chehab 1046e548cdefSMauro Carvalho ChehabThe third concern is addressed below: there are some methods to reduce 1047e548cdefSMauro Carvalho Chehabthe amount of locking which needs to be done. 1048e548cdefSMauro Carvalho Chehab 1049e548cdefSMauro Carvalho ChehabRead/Write Lock Variants 1050e548cdefSMauro Carvalho Chehab------------------------ 1051e548cdefSMauro Carvalho Chehab 1052e548cdefSMauro Carvalho ChehabBoth spinlocks and mutexes have read/write variants: ``rwlock_t`` and 1053e548cdefSMauro Carvalho Chehab:c:type:`struct rw_semaphore <rw_semaphore>`. These divide 1054e548cdefSMauro Carvalho Chehabusers into two classes: the readers and the writers. If you are only 1055e548cdefSMauro Carvalho Chehabreading the data, you can get a read lock, but to write to the data you 1056e548cdefSMauro Carvalho Chehabneed the write lock. Many people can hold a read lock, but a writer must 1057e548cdefSMauro Carvalho Chehabbe sole holder. 1058e548cdefSMauro Carvalho Chehab 1059e548cdefSMauro Carvalho ChehabIf your code divides neatly along reader/writer lines (as our cache code 1060e548cdefSMauro Carvalho Chehabdoes), and the lock is held by readers for significant lengths of time, 1061e548cdefSMauro Carvalho Chehabusing these locks can help. They are slightly slower than the normal 1062e548cdefSMauro Carvalho Chehablocks though, so in practice ``rwlock_t`` is not usually worthwhile. 1063e548cdefSMauro Carvalho Chehab 1064e548cdefSMauro Carvalho ChehabAvoiding Locks: Read Copy Update 1065e548cdefSMauro Carvalho Chehab-------------------------------- 1066e548cdefSMauro Carvalho Chehab 1067e548cdefSMauro Carvalho ChehabThere is a special method of read/write locking called Read Copy Update. 1068e548cdefSMauro Carvalho ChehabUsing RCU, the readers can avoid taking a lock altogether: as we expect 1069e548cdefSMauro Carvalho Chehabour cache to be read more often than updated (otherwise the cache is a 1070e548cdefSMauro Carvalho Chehabwaste of time), it is a candidate for this optimization. 1071e548cdefSMauro Carvalho Chehab 1072e548cdefSMauro Carvalho ChehabHow do we get rid of read locks? Getting rid of read locks means that 1073e548cdefSMauro Carvalho Chehabwriters may be changing the list underneath the readers. That is 1074e548cdefSMauro Carvalho Chehabactually quite simple: we can read a linked list while an element is 1075e548cdefSMauro Carvalho Chehabbeing added if the writer adds the element very carefully. For example, 1076e548cdefSMauro Carvalho Chehabadding ``new`` to a single linked list called ``list``:: 1077e548cdefSMauro Carvalho Chehab 1078e548cdefSMauro Carvalho Chehab new->next = list->next; 1079e548cdefSMauro Carvalho Chehab wmb(); 1080e548cdefSMauro Carvalho Chehab list->next = new; 1081e548cdefSMauro Carvalho Chehab 1082e548cdefSMauro Carvalho Chehab 1083b1735296SStephen BoydThe wmb() is a write memory barrier. It ensures that the 1084e548cdefSMauro Carvalho Chehabfirst operation (setting the new element's ``next`` pointer) is complete 1085e548cdefSMauro Carvalho Chehaband will be seen by all CPUs, before the second operation is (putting 1086e548cdefSMauro Carvalho Chehabthe new element into the list). This is important, since modern 1087e548cdefSMauro Carvalho Chehabcompilers and modern CPUs can both reorder instructions unless told 1088e548cdefSMauro Carvalho Chehabotherwise: we want a reader to either not see the new element at all, or 1089e548cdefSMauro Carvalho Chehabsee the new element with the ``next`` pointer correctly pointing at the 1090e548cdefSMauro Carvalho Chehabrest of the list. 1091e548cdefSMauro Carvalho Chehab 1092e548cdefSMauro Carvalho ChehabFortunately, there is a function to do this for standard 1093e548cdefSMauro Carvalho Chehab:c:type:`struct list_head <list_head>` lists: 1094b1735296SStephen Boydlist_add_rcu() (``include/linux/list.h``). 1095e548cdefSMauro Carvalho Chehab 1096e548cdefSMauro Carvalho ChehabRemoving an element from the list is even simpler: we replace the 1097e548cdefSMauro Carvalho Chehabpointer to the old element with a pointer to its successor, and readers 1098e548cdefSMauro Carvalho Chehabwill either see it, or skip over it. 1099e548cdefSMauro Carvalho Chehab 1100e548cdefSMauro Carvalho Chehab:: 1101e548cdefSMauro Carvalho Chehab 1102e548cdefSMauro Carvalho Chehab list->next = old->next; 1103e548cdefSMauro Carvalho Chehab 1104e548cdefSMauro Carvalho Chehab 1105b1735296SStephen BoydThere is list_del_rcu() (``include/linux/list.h``) which 1106e548cdefSMauro Carvalho Chehabdoes this (the normal version poisons the old object, which we don't 1107e548cdefSMauro Carvalho Chehabwant). 1108e548cdefSMauro Carvalho Chehab 1109e548cdefSMauro Carvalho ChehabThe reader must also be careful: some CPUs can look through the ``next`` 1110e548cdefSMauro Carvalho Chehabpointer to start reading the contents of the next element early, but 1111e548cdefSMauro Carvalho Chehabdon't realize that the pre-fetched contents is wrong when the ``next`` 1112e548cdefSMauro Carvalho Chehabpointer changes underneath them. Once again, there is a 1113b1735296SStephen Boydlist_for_each_entry_rcu() (``include/linux/list.h``) 1114e548cdefSMauro Carvalho Chehabto help you. Of course, writers can just use 1115b1735296SStephen Boydlist_for_each_entry(), since there cannot be two 1116e548cdefSMauro Carvalho Chehabsimultaneous writers. 1117e548cdefSMauro Carvalho Chehab 1118e548cdefSMauro Carvalho ChehabOur final dilemma is this: when can we actually destroy the removed 1119e548cdefSMauro Carvalho Chehabelement? Remember, a reader might be stepping through this element in 1120e548cdefSMauro Carvalho Chehabthe list right now: if we free this element and the ``next`` pointer 1121e548cdefSMauro Carvalho Chehabchanges, the reader will jump off into garbage and crash. We need to 1122e548cdefSMauro Carvalho Chehabwait until we know that all the readers who were traversing the list 1123e548cdefSMauro Carvalho Chehabwhen we deleted the element are finished. We use 1124b1735296SStephen Boydcall_rcu() to register a callback which will actually 1125e548cdefSMauro Carvalho Chehabdestroy the object once all pre-existing readers are finished. 1126b1735296SStephen BoydAlternatively, synchronize_rcu() may be used to block 1127e548cdefSMauro Carvalho Chehabuntil all pre-existing are finished. 1128e548cdefSMauro Carvalho Chehab 1129e548cdefSMauro Carvalho ChehabBut how does Read Copy Update know when the readers are finished? The 1130e548cdefSMauro Carvalho Chehabmethod is this: firstly, the readers always traverse the list inside 1131b1735296SStephen Boydrcu_read_lock()/rcu_read_unlock() pairs: 1132e548cdefSMauro Carvalho Chehabthese simply disable preemption so the reader won't go to sleep while 1133e548cdefSMauro Carvalho Chehabreading the list. 1134e548cdefSMauro Carvalho Chehab 1135e548cdefSMauro Carvalho ChehabRCU then waits until every other CPU has slept at least once: since 1136e548cdefSMauro Carvalho Chehabreaders cannot sleep, we know that any readers which were traversing the 1137e548cdefSMauro Carvalho Chehablist during the deletion are finished, and the callback is triggered. 1138e548cdefSMauro Carvalho ChehabThe real Read Copy Update code is a little more optimized than this, but 1139e548cdefSMauro Carvalho Chehabthis is the fundamental idea. 1140e548cdefSMauro Carvalho Chehab 1141e548cdefSMauro Carvalho Chehab:: 1142e548cdefSMauro Carvalho Chehab 1143e548cdefSMauro Carvalho Chehab --- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 1144e548cdefSMauro Carvalho Chehab +++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 1145e548cdefSMauro Carvalho Chehab @@ -1,15 +1,18 @@ 1146e548cdefSMauro Carvalho Chehab #include <linux/list.h> 1147e548cdefSMauro Carvalho Chehab #include <linux/slab.h> 1148e548cdefSMauro Carvalho Chehab #include <linux/string.h> 1149e548cdefSMauro Carvalho Chehab +#include <linux/rcupdate.h> 1150e548cdefSMauro Carvalho Chehab #include <linux/mutex.h> 1151e548cdefSMauro Carvalho Chehab #include <asm/errno.h> 1152e548cdefSMauro Carvalho Chehab 1153e548cdefSMauro Carvalho Chehab struct object 1154e548cdefSMauro Carvalho Chehab { 1155e548cdefSMauro Carvalho Chehab - /* These two protected by cache_lock. */ 1156e548cdefSMauro Carvalho Chehab + /* This is protected by RCU */ 1157e548cdefSMauro Carvalho Chehab struct list_head list; 1158e548cdefSMauro Carvalho Chehab int popularity; 1159e548cdefSMauro Carvalho Chehab 1160e548cdefSMauro Carvalho Chehab + struct rcu_head rcu; 1161e548cdefSMauro Carvalho Chehab + 1162e548cdefSMauro Carvalho Chehab atomic_t refcnt; 1163e548cdefSMauro Carvalho Chehab 1164e548cdefSMauro Carvalho Chehab /* Doesn't change once created. */ 1165e548cdefSMauro Carvalho Chehab @@ -40,7 +43,7 @@ 1166e548cdefSMauro Carvalho Chehab { 1167e548cdefSMauro Carvalho Chehab struct object *i; 1168e548cdefSMauro Carvalho Chehab 1169e548cdefSMauro Carvalho Chehab - list_for_each_entry(i, &cache, list) { 1170e548cdefSMauro Carvalho Chehab + list_for_each_entry_rcu(i, &cache, list) { 1171e548cdefSMauro Carvalho Chehab if (i->id == id) { 1172e548cdefSMauro Carvalho Chehab i->popularity++; 1173e548cdefSMauro Carvalho Chehab return i; 1174e548cdefSMauro Carvalho Chehab @@ -49,19 +52,25 @@ 1175e548cdefSMauro Carvalho Chehab return NULL; 1176e548cdefSMauro Carvalho Chehab } 1177e548cdefSMauro Carvalho Chehab 1178e548cdefSMauro Carvalho Chehab +/* Final discard done once we know no readers are looking. */ 1179e548cdefSMauro Carvalho Chehab +static void cache_delete_rcu(void *arg) 1180e548cdefSMauro Carvalho Chehab +{ 1181e548cdefSMauro Carvalho Chehab + object_put(arg); 1182e548cdefSMauro Carvalho Chehab +} 1183e548cdefSMauro Carvalho Chehab + 1184e548cdefSMauro Carvalho Chehab /* Must be holding cache_lock */ 1185e548cdefSMauro Carvalho Chehab static void __cache_delete(struct object *obj) 1186e548cdefSMauro Carvalho Chehab { 1187e548cdefSMauro Carvalho Chehab BUG_ON(!obj); 1188e548cdefSMauro Carvalho Chehab - list_del(&obj->list); 1189e548cdefSMauro Carvalho Chehab - object_put(obj); 1190e548cdefSMauro Carvalho Chehab + list_del_rcu(&obj->list); 1191e548cdefSMauro Carvalho Chehab cache_num--; 1192e548cdefSMauro Carvalho Chehab + call_rcu(&obj->rcu, cache_delete_rcu); 1193e548cdefSMauro Carvalho Chehab } 1194e548cdefSMauro Carvalho Chehab 1195e548cdefSMauro Carvalho Chehab /* Must be holding cache_lock */ 1196e548cdefSMauro Carvalho Chehab static void __cache_add(struct object *obj) 1197e548cdefSMauro Carvalho Chehab { 1198e548cdefSMauro Carvalho Chehab - list_add(&obj->list, &cache); 1199e548cdefSMauro Carvalho Chehab + list_add_rcu(&obj->list, &cache); 1200e548cdefSMauro Carvalho Chehab if (++cache_num > MAX_CACHE_SIZE) { 1201e548cdefSMauro Carvalho Chehab struct object *i, *outcast = NULL; 1202e548cdefSMauro Carvalho Chehab list_for_each_entry(i, &cache, list) { 1203e548cdefSMauro Carvalho Chehab @@ -104,12 +114,11 @@ 1204e548cdefSMauro Carvalho Chehab struct object *cache_find(int id) 1205e548cdefSMauro Carvalho Chehab { 1206e548cdefSMauro Carvalho Chehab struct object *obj; 1207e548cdefSMauro Carvalho Chehab - unsigned long flags; 1208e548cdefSMauro Carvalho Chehab 1209e548cdefSMauro Carvalho Chehab - spin_lock_irqsave(&cache_lock, flags); 1210e548cdefSMauro Carvalho Chehab + rcu_read_lock(); 1211e548cdefSMauro Carvalho Chehab obj = __cache_find(id); 1212e548cdefSMauro Carvalho Chehab if (obj) 1213e548cdefSMauro Carvalho Chehab object_get(obj); 1214e548cdefSMauro Carvalho Chehab - spin_unlock_irqrestore(&cache_lock, flags); 1215e548cdefSMauro Carvalho Chehab + rcu_read_unlock(); 1216e548cdefSMauro Carvalho Chehab return obj; 1217e548cdefSMauro Carvalho Chehab } 1218e548cdefSMauro Carvalho Chehab 1219e548cdefSMauro Carvalho ChehabNote that the reader will alter the popularity member in 1220b1735296SStephen Boyd__cache_find(), and now it doesn't hold a lock. One 1221e548cdefSMauro Carvalho Chehabsolution would be to make it an ``atomic_t``, but for this usage, we 1222e548cdefSMauro Carvalho Chehabdon't really care about races: an approximate result is good enough, so 1223e548cdefSMauro Carvalho ChehabI didn't change it. 1224e548cdefSMauro Carvalho Chehab 1225b1735296SStephen BoydThe result is that cache_find() requires no 1226e548cdefSMauro Carvalho Chehabsynchronization with any other functions, so is almost as fast on SMP as 1227e548cdefSMauro Carvalho Chehabit would be on UP. 1228e548cdefSMauro Carvalho Chehab 1229e548cdefSMauro Carvalho ChehabThere is a further optimization possible here: remember our original 1230e548cdefSMauro Carvalho Chehabcache code, where there were no reference counts and the caller simply 1231e548cdefSMauro Carvalho Chehabheld the lock whenever using the object? This is still possible: if you 1232e548cdefSMauro Carvalho Chehabhold the lock, no one can delete the object, so you don't need to get 1233e548cdefSMauro Carvalho Chehaband put the reference count. 1234e548cdefSMauro Carvalho Chehab 1235e548cdefSMauro Carvalho ChehabNow, because the 'read lock' in RCU is simply disabling preemption, a 1236e548cdefSMauro Carvalho Chehabcaller which always has preemption disabled between calling 1237b1735296SStephen Boydcache_find() and object_put() does not 1238e548cdefSMauro Carvalho Chehabneed to actually get and put the reference count: we could expose 1239b1735296SStephen Boyd__cache_find() by making it non-static, and such 1240e548cdefSMauro Carvalho Chehabcallers could simply call that. 1241e548cdefSMauro Carvalho Chehab 1242e548cdefSMauro Carvalho ChehabThe benefit here is that the reference count is not written to: the 1243e548cdefSMauro Carvalho Chehabobject is not altered in any way, which is much faster on SMP machines 1244e548cdefSMauro Carvalho Chehabdue to caching. 1245e548cdefSMauro Carvalho Chehab 1246e548cdefSMauro Carvalho ChehabPer-CPU Data 1247e548cdefSMauro Carvalho Chehab------------ 1248e548cdefSMauro Carvalho Chehab 1249e548cdefSMauro Carvalho ChehabAnother technique for avoiding locking which is used fairly widely is to 1250e548cdefSMauro Carvalho Chehabduplicate information for each CPU. For example, if you wanted to keep a 1251e548cdefSMauro Carvalho Chehabcount of a common condition, you could use a spin lock and a single 1252e548cdefSMauro Carvalho Chehabcounter. Nice and simple. 1253e548cdefSMauro Carvalho Chehab 1254e548cdefSMauro Carvalho ChehabIf that was too slow (it's usually not, but if you've got a really big 1255e548cdefSMauro Carvalho Chehabmachine to test on and can show that it is), you could instead use a 1256e548cdefSMauro Carvalho Chehabcounter for each CPU, then none of them need an exclusive lock. See 1257b1735296SStephen BoydDEFINE_PER_CPU(), get_cpu_var() and 1258b1735296SStephen Boydput_cpu_var() (``include/linux/percpu.h``). 1259e548cdefSMauro Carvalho Chehab 1260e548cdefSMauro Carvalho ChehabOf particular use for simple per-cpu counters is the ``local_t`` type, 1261b1735296SStephen Boydand the cpu_local_inc() and related functions, which are 1262e548cdefSMauro Carvalho Chehabmore efficient than simple code on some architectures 1263e548cdefSMauro Carvalho Chehab(``include/asm/local.h``). 1264e548cdefSMauro Carvalho Chehab 1265e548cdefSMauro Carvalho ChehabNote that there is no simple, reliable way of getting an exact value of 1266e548cdefSMauro Carvalho Chehabsuch a counter, without introducing more locks. This is not a problem 1267e548cdefSMauro Carvalho Chehabfor some uses. 1268e548cdefSMauro Carvalho Chehab 1269e548cdefSMauro Carvalho ChehabData Which Mostly Used By An IRQ Handler 1270e548cdefSMauro Carvalho Chehab---------------------------------------- 1271e548cdefSMauro Carvalho Chehab 1272e548cdefSMauro Carvalho ChehabIf data is always accessed from within the same IRQ handler, you don't 1273e548cdefSMauro Carvalho Chehabneed a lock at all: the kernel already guarantees that the irq handler 1274e548cdefSMauro Carvalho Chehabwill not run simultaneously on multiple CPUs. 1275e548cdefSMauro Carvalho Chehab 1276e548cdefSMauro Carvalho ChehabManfred Spraul points out that you can still do this, even if the data 1277e548cdefSMauro Carvalho Chehabis very occasionally accessed in user context or softirqs/tasklets. The 1278e548cdefSMauro Carvalho Chehabirq handler doesn't use a lock, and all other accesses are done as so:: 1279e548cdefSMauro Carvalho Chehab 1280*379af13bSAlexander Sverdlin mutex_lock(&lock); 1281e548cdefSMauro Carvalho Chehab disable_irq(irq); 1282e548cdefSMauro Carvalho Chehab ... 1283e548cdefSMauro Carvalho Chehab enable_irq(irq); 1284*379af13bSAlexander Sverdlin mutex_unlock(&lock); 1285e548cdefSMauro Carvalho Chehab 1286b1735296SStephen BoydThe disable_irq() prevents the irq handler from running 1287e548cdefSMauro Carvalho Chehab(and waits for it to finish if it's currently running on other CPUs). 1288e548cdefSMauro Carvalho ChehabThe spinlock prevents any other accesses happening at the same time. 1289b1735296SStephen BoydNaturally, this is slower than just a spin_lock_irq() 1290e548cdefSMauro Carvalho Chehabcall, so it only makes sense if this type of access happens extremely 1291e548cdefSMauro Carvalho Chehabrarely. 1292e548cdefSMauro Carvalho Chehab 1293e548cdefSMauro Carvalho ChehabWhat Functions Are Safe To Call From Interrupts? 1294e548cdefSMauro Carvalho Chehab================================================ 1295e548cdefSMauro Carvalho Chehab 1296e548cdefSMauro Carvalho ChehabMany functions in the kernel sleep (ie. call schedule()) directly or 1297e548cdefSMauro Carvalho Chehabindirectly: you can never call them while holding a spinlock, or with 1298e548cdefSMauro Carvalho Chehabpreemption disabled. This also means you need to be in user context: 1299e548cdefSMauro Carvalho Chehabcalling them from an interrupt is illegal. 1300e548cdefSMauro Carvalho Chehab 1301e548cdefSMauro Carvalho ChehabSome Functions Which Sleep 1302e548cdefSMauro Carvalho Chehab-------------------------- 1303e548cdefSMauro Carvalho Chehab 1304e548cdefSMauro Carvalho ChehabThe most common ones are listed below, but you usually have to read the 1305e548cdefSMauro Carvalho Chehabcode to find out if other calls are safe. If everyone else who calls it 1306e548cdefSMauro Carvalho Chehabcan sleep, you probably need to be able to sleep, too. In particular, 1307e548cdefSMauro Carvalho Chehabregistration and deregistration functions usually expect to be called 1308e548cdefSMauro Carvalho Chehabfrom user context, and can sleep. 1309e548cdefSMauro Carvalho Chehab 1310e548cdefSMauro Carvalho Chehab- Accesses to userspace: 1311e548cdefSMauro Carvalho Chehab 1312b1735296SStephen Boyd - copy_from_user() 1313e548cdefSMauro Carvalho Chehab 1314b1735296SStephen Boyd - copy_to_user() 1315e548cdefSMauro Carvalho Chehab 1316b1735296SStephen Boyd - get_user() 1317e548cdefSMauro Carvalho Chehab 1318b1735296SStephen Boyd - put_user() 1319e548cdefSMauro Carvalho Chehab 1320b1735296SStephen Boyd- kmalloc(GP_KERNEL) <kmalloc>` 1321e548cdefSMauro Carvalho Chehab 1322b1735296SStephen Boyd- mutex_lock_interruptible() and 1323b1735296SStephen Boyd mutex_lock() 1324e548cdefSMauro Carvalho Chehab 1325b1735296SStephen Boyd There is a mutex_trylock() which does not sleep. 1326e548cdefSMauro Carvalho Chehab Still, it must not be used inside interrupt context since its 1327b1735296SStephen Boyd implementation is not safe for that. mutex_unlock() 1328e548cdefSMauro Carvalho Chehab will also never sleep. It cannot be used in interrupt context either 1329e548cdefSMauro Carvalho Chehab since a mutex must be released by the same task that acquired it. 1330e548cdefSMauro Carvalho Chehab 1331e548cdefSMauro Carvalho ChehabSome Functions Which Don't Sleep 1332e548cdefSMauro Carvalho Chehab-------------------------------- 1333e548cdefSMauro Carvalho Chehab 1334e548cdefSMauro Carvalho ChehabSome functions are safe to call from any context, or holding almost any 1335e548cdefSMauro Carvalho Chehablock. 1336e548cdefSMauro Carvalho Chehab 1337b1735296SStephen Boyd- printk() 1338e548cdefSMauro Carvalho Chehab 1339b1735296SStephen Boyd- kfree() 1340e548cdefSMauro Carvalho Chehab 134187bdd932SThomas Gleixner- add_timer() and timer_delete() 1342e548cdefSMauro Carvalho Chehab 1343e548cdefSMauro Carvalho ChehabMutex API reference 1344e548cdefSMauro Carvalho Chehab=================== 1345e548cdefSMauro Carvalho Chehab 1346e548cdefSMauro Carvalho Chehab.. kernel-doc:: include/linux/mutex.h 1347e548cdefSMauro Carvalho Chehab :internal: 1348e548cdefSMauro Carvalho Chehab 1349e548cdefSMauro Carvalho Chehab.. kernel-doc:: kernel/locking/mutex.c 1350e548cdefSMauro Carvalho Chehab :export: 1351e548cdefSMauro Carvalho Chehab 1352e548cdefSMauro Carvalho ChehabFutex API reference 1353e548cdefSMauro Carvalho Chehab=================== 1354e548cdefSMauro Carvalho Chehab 1355bc67f1c4SAndré Almeida.. kernel-doc:: kernel/futex/core.c 1356bc67f1c4SAndré Almeida :internal: 1357bc67f1c4SAndré Almeida 1358bc67f1c4SAndré Almeida.. kernel-doc:: kernel/futex/futex.h 1359bc67f1c4SAndré Almeida :internal: 1360bc67f1c4SAndré Almeida 1361bc67f1c4SAndré Almeida.. kernel-doc:: kernel/futex/pi.c 1362bc67f1c4SAndré Almeida :internal: 1363bc67f1c4SAndré Almeida 1364bc67f1c4SAndré Almeida.. kernel-doc:: kernel/futex/requeue.c 1365bc67f1c4SAndré Almeida :internal: 1366bc67f1c4SAndré Almeida 1367bc67f1c4SAndré Almeida.. kernel-doc:: kernel/futex/waitwake.c 1368e548cdefSMauro Carvalho Chehab :internal: 1369e548cdefSMauro Carvalho Chehab 1370e548cdefSMauro Carvalho ChehabFurther reading 1371e548cdefSMauro Carvalho Chehab=============== 1372e548cdefSMauro Carvalho Chehab 1373387b1468SMauro Carvalho Chehab- ``Documentation/locking/spinlocks.rst``: Linus Torvalds' spinlocking 1374e548cdefSMauro Carvalho Chehab tutorial in the kernel sources. 1375e548cdefSMauro Carvalho Chehab 1376e548cdefSMauro Carvalho Chehab- Unix Systems for Modern Architectures: Symmetric Multiprocessing and 1377e548cdefSMauro Carvalho Chehab Caching for Kernel Programmers: 1378e548cdefSMauro Carvalho Chehab 1379e548cdefSMauro Carvalho Chehab Curt Schimmel's very good introduction to kernel level locking (not 1380e548cdefSMauro Carvalho Chehab written for Linux, but nearly everything applies). The book is 1381e548cdefSMauro Carvalho Chehab expensive, but really worth every penny to understand SMP locking. 1382e548cdefSMauro Carvalho Chehab [ISBN: 0201633388] 1383e548cdefSMauro Carvalho Chehab 1384e548cdefSMauro Carvalho ChehabThanks 1385e548cdefSMauro Carvalho Chehab====== 1386e548cdefSMauro Carvalho Chehab 1387e548cdefSMauro Carvalho ChehabThanks to Telsa Gwynne for DocBooking, neatening and adding style. 1388e548cdefSMauro Carvalho Chehab 1389e548cdefSMauro Carvalho ChehabThanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul Mackerras, 1390e548cdefSMauro Carvalho ChehabRuedi Aschwanden, Alan Cox, Manfred Spraul, Tim Waugh, Pete Zaitcev, 1391e548cdefSMauro Carvalho ChehabJames Morris, Robert Love, Paul McKenney, John Ashby for proofreading, 1392e548cdefSMauro Carvalho Chehabcorrecting, flaming, commenting. 1393e548cdefSMauro Carvalho Chehab 1394e548cdefSMauro Carvalho ChehabThanks to the cabal for having no influence on this document. 1395e548cdefSMauro Carvalho Chehab 1396e548cdefSMauro Carvalho ChehabGlossary 1397e548cdefSMauro Carvalho Chehab======== 1398e548cdefSMauro Carvalho Chehab 1399e548cdefSMauro Carvalho Chehabpreemption 1400e548cdefSMauro Carvalho Chehab Prior to 2.5, or when ``CONFIG_PREEMPT`` is unset, processes in user 1401e548cdefSMauro Carvalho Chehab context inside the kernel would not preempt each other (ie. you had that 1402e548cdefSMauro Carvalho Chehab CPU until you gave it up, except for interrupts). With the addition of 1403e548cdefSMauro Carvalho Chehab ``CONFIG_PREEMPT`` in 2.5.4, this changed: when in user context, higher 1404e548cdefSMauro Carvalho Chehab priority tasks can "cut in": spinlocks were changed to disable 1405e548cdefSMauro Carvalho Chehab preemption, even on UP. 1406e548cdefSMauro Carvalho Chehab 1407e548cdefSMauro Carvalho Chehabbh 1408e548cdefSMauro Carvalho Chehab Bottom Half: for historical reasons, functions with '_bh' in them often 1409b1735296SStephen Boyd now refer to any software interrupt, e.g. spin_lock_bh() 1410e548cdefSMauro Carvalho Chehab blocks any software interrupt on the current CPU. Bottom halves are 1411e548cdefSMauro Carvalho Chehab deprecated, and will eventually be replaced by tasklets. Only one bottom 1412e548cdefSMauro Carvalho Chehab half will be running at any time. 1413e548cdefSMauro Carvalho Chehab 1414e548cdefSMauro Carvalho ChehabHardware Interrupt / Hardware IRQ 1415fe450eebSChangbin Du Hardware interrupt request. in_hardirq() returns true in a 1416e548cdefSMauro Carvalho Chehab hardware interrupt handler. 1417e548cdefSMauro Carvalho Chehab 1418e548cdefSMauro Carvalho ChehabInterrupt Context 1419e548cdefSMauro Carvalho Chehab Not user context: processing a hardware irq or software irq. Indicated 1420b1735296SStephen Boyd by the in_interrupt() macro returning true. 1421e548cdefSMauro Carvalho Chehab 1422e548cdefSMauro Carvalho ChehabSMP 1423e548cdefSMauro Carvalho Chehab Symmetric Multi-Processor: kernels compiled for multiple-CPU machines. 1424e548cdefSMauro Carvalho Chehab (``CONFIG_SMP=y``). 1425e548cdefSMauro Carvalho Chehab 1426e548cdefSMauro Carvalho ChehabSoftware Interrupt / softirq 1427fe450eebSChangbin Du Software interrupt handler. in_hardirq() returns false; 1428b1735296SStephen Boyd in_softirq() returns true. Tasklets and softirqs both 1429e548cdefSMauro Carvalho Chehab fall into the category of 'software interrupts'. 1430e548cdefSMauro Carvalho Chehab 1431e548cdefSMauro Carvalho Chehab Strictly speaking a softirq is one of up to 32 enumerated software 1432e548cdefSMauro Carvalho Chehab interrupts which can run on multiple CPUs at once. Sometimes used to 1433e548cdefSMauro Carvalho Chehab refer to tasklets as well (ie. all software interrupts). 1434e548cdefSMauro Carvalho Chehab 1435e548cdefSMauro Carvalho Chehabtasklet 1436e548cdefSMauro Carvalho Chehab A dynamically-registrable software interrupt, which is guaranteed to 1437e548cdefSMauro Carvalho Chehab only run on one CPU at a time. 1438e548cdefSMauro Carvalho Chehab 1439e548cdefSMauro Carvalho Chehabtimer 1440e548cdefSMauro Carvalho Chehab A dynamically-registrable software interrupt, which is run at (or close 1441e548cdefSMauro Carvalho Chehab to) a given time. When running, it is just like a tasklet (in fact, they 1442dc89fca9SMauro Carvalho Chehab are called from the ``TIMER_SOFTIRQ``). 1443e548cdefSMauro Carvalho Chehab 1444e548cdefSMauro Carvalho ChehabUP 1445dc89fca9SMauro Carvalho Chehab Uni-Processor: Non-SMP. (``CONFIG_SMP=n``). 1446e548cdefSMauro Carvalho Chehab 1447e548cdefSMauro Carvalho ChehabUser Context 1448e548cdefSMauro Carvalho Chehab The kernel executing on behalf of a particular process (ie. a system 1449e548cdefSMauro Carvalho Chehab call or trap) or kernel thread. You can tell which process with the 1450e548cdefSMauro Carvalho Chehab ``current`` macro.) Not to be confused with userspace. Can be 1451e548cdefSMauro Carvalho Chehab interrupted by software or hardware interrupts. 1452e548cdefSMauro Carvalho Chehab 1453e548cdefSMauro Carvalho ChehabUserspace 1454e548cdefSMauro Carvalho Chehab A process executing its own code outside the kernel. 1455