152d7e21fSMike Rapoport.. _memory_hotplug: 252d7e21fSMike Rapoport 352d7e21fSMike Rapoport============== 452d7e21fSMike RapoportMemory hotplug 552d7e21fSMike Rapoport============== 652d7e21fSMike Rapoport 752d7e21fSMike RapoportMemory hotplug event notifier 852d7e21fSMike Rapoport============================= 952d7e21fSMike Rapoport 1052d7e21fSMike RapoportHotplugging events are sent to a notification queue. 1152d7e21fSMike Rapoport 1252d7e21fSMike RapoportThere are six types of notification defined in ``include/linux/memory.h``: 1352d7e21fSMike Rapoport 1452d7e21fSMike RapoportMEM_GOING_ONLINE 1552d7e21fSMike Rapoport Generated before new memory becomes available in order to be able to 1652d7e21fSMike Rapoport prepare subsystems to handle memory. The page allocator is still unable 1752d7e21fSMike Rapoport to allocate from the new memory. 1852d7e21fSMike Rapoport 1952d7e21fSMike RapoportMEM_CANCEL_ONLINE 2052d7e21fSMike Rapoport Generated if MEM_GOING_ONLINE fails. 2152d7e21fSMike Rapoport 2252d7e21fSMike RapoportMEM_ONLINE 2352d7e21fSMike Rapoport Generated when memory has successfully brought online. The callback may 2452d7e21fSMike Rapoport allocate pages from the new memory. 2552d7e21fSMike Rapoport 2652d7e21fSMike RapoportMEM_GOING_OFFLINE 2752d7e21fSMike Rapoport Generated to begin the process of offlining memory. Allocations are no 2852d7e21fSMike Rapoport longer possible from the memory but some of the memory to be offlined 2952d7e21fSMike Rapoport is still in use. The callback can be used to free memory known to a 3052d7e21fSMike Rapoport subsystem from the indicated memory block. 3152d7e21fSMike Rapoport 3252d7e21fSMike RapoportMEM_CANCEL_OFFLINE 3352d7e21fSMike Rapoport Generated if MEM_GOING_OFFLINE fails. Memory is available again from 3452d7e21fSMike Rapoport the memory block that we attempted to offline. 3552d7e21fSMike Rapoport 3652d7e21fSMike RapoportMEM_OFFLINE 3752d7e21fSMike Rapoport Generated after offlining memory is complete. 3852d7e21fSMike Rapoport 3952d7e21fSMike RapoportA callback routine can be registered by calling:: 4052d7e21fSMike Rapoport 4152d7e21fSMike Rapoport hotplug_memory_notifier(callback_func, priority) 4252d7e21fSMike Rapoport 4352d7e21fSMike RapoportCallback functions with higher values of priority are called before callback 4452d7e21fSMike Rapoportfunctions with lower values. 4552d7e21fSMike Rapoport 4652d7e21fSMike RapoportA callback function must have the following prototype:: 4752d7e21fSMike Rapoport 4852d7e21fSMike Rapoport int callback_func( 4952d7e21fSMike Rapoport struct notifier_block *self, unsigned long action, void *arg); 5052d7e21fSMike Rapoport 5152d7e21fSMike RapoportThe first argument of the callback function (self) is a pointer to the block 5252d7e21fSMike Rapoportof the notifier chain that points to the callback function itself. 5352d7e21fSMike RapoportThe second argument (action) is one of the event types described above. 5452d7e21fSMike RapoportThe third argument (arg) passes a pointer of struct memory_notify:: 5552d7e21fSMike Rapoport 5652d7e21fSMike Rapoport struct memory_notify { 5752d7e21fSMike Rapoport unsigned long start_pfn; 5852d7e21fSMike Rapoport unsigned long nr_pages; 5952d7e21fSMike Rapoport int status_change_nid_normal; 6052d7e21fSMike Rapoport int status_change_nid; 6152d7e21fSMike Rapoport } 6252d7e21fSMike Rapoport 6352d7e21fSMike Rapoport- start_pfn is start_pfn of online/offline memory. 6452d7e21fSMike Rapoport- nr_pages is # of pages of online/offline memory. 6552d7e21fSMike Rapoport- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask 6652d7e21fSMike Rapoport is (will be) set/clear, if this is -1, then nodemask status is not changed. 6752d7e21fSMike Rapoport- status_change_nid is set node id when N_MEMORY of nodemask is (will be) 6852d7e21fSMike Rapoport set/clear. It means a new(memoryless) node gets new memory by online and a 6952d7e21fSMike Rapoport node loses all memory. If this is -1, then nodemask status is not changed. 7052d7e21fSMike Rapoport 7152d7e21fSMike Rapoport If status_changed_nid* >= 0, callback should create/discard structures for the 7252d7e21fSMike Rapoport node if necessary. 7352d7e21fSMike Rapoport 7452d7e21fSMike RapoportThe callback routine shall return one of the values 7552d7e21fSMike RapoportNOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP 7652d7e21fSMike Rapoportdefined in ``include/linux/notifier.h`` 7752d7e21fSMike Rapoport 7852d7e21fSMike RapoportNOTIFY_DONE and NOTIFY_OK have no effect on the further processing. 7952d7e21fSMike Rapoport 8052d7e21fSMike RapoportNOTIFY_BAD is used as response to the MEM_GOING_ONLINE, MEM_GOING_OFFLINE, 8152d7e21fSMike RapoportMEM_ONLINE, or MEM_OFFLINE action to cancel hotplugging. It stops 8252d7e21fSMike Rapoportfurther processing of the notification queue. 8352d7e21fSMike Rapoport 8452d7e21fSMike RapoportNOTIFY_STOP stops further processing of the notification queue. 85*3a7452c5SDavid Hildenbrand 86*3a7452c5SDavid HildenbrandLocking Internals 87*3a7452c5SDavid Hildenbrand================= 88*3a7452c5SDavid Hildenbrand 89*3a7452c5SDavid HildenbrandWhen adding/removing memory that uses memory block devices (i.e. ordinary RAM), 90*3a7452c5SDavid Hildenbrandthe device_hotplug_lock should be held to: 91*3a7452c5SDavid Hildenbrand 92*3a7452c5SDavid Hildenbrand- synchronize against online/offline requests (e.g. via sysfs). This way, memory 93*3a7452c5SDavid Hildenbrand block devices can only be accessed (.online/.state attributes) by user 94*3a7452c5SDavid Hildenbrand space once memory has been fully added. And when removing memory, we 95*3a7452c5SDavid Hildenbrand know nobody is in critical sections. 96*3a7452c5SDavid Hildenbrand- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC) 97*3a7452c5SDavid Hildenbrand 98*3a7452c5SDavid HildenbrandEspecially, there is a possible lock inversion that is avoided using 99*3a7452c5SDavid Hildenbranddevice_hotplug_lock when adding memory and user space tries to online that 100*3a7452c5SDavid Hildenbrandmemory faster than expected: 101*3a7452c5SDavid Hildenbrand 102*3a7452c5SDavid Hildenbrand- device_online() will first take the device_lock(), followed by 103*3a7452c5SDavid Hildenbrand mem_hotplug_lock 104*3a7452c5SDavid Hildenbrand- add_memory_resource() will first take the mem_hotplug_lock, followed by 105*3a7452c5SDavid Hildenbrand the device_lock() (while creating the devices, during bus_add_device()). 106*3a7452c5SDavid Hildenbrand 107*3a7452c5SDavid HildenbrandAs the device is visible to user space before taking the device_lock(), this 108*3a7452c5SDavid Hildenbrandcan result in a lock inversion. 109*3a7452c5SDavid Hildenbrand 110*3a7452c5SDavid Hildenbrandonlining/offlining of memory should be done via device_online()/ 111*3a7452c5SDavid Hildenbranddevice_offline() - to make sure it is properly synchronized to actions 112*3a7452c5SDavid Hildenbrandvia sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type) 113*3a7452c5SDavid Hildenbrand 114*3a7452c5SDavid HildenbrandWhen adding/removing/onlining/offlining memory or adding/removing 115*3a7452c5SDavid Hildenbrandheterogeneous/device memory, we should always hold the mem_hotplug_lock in 116*3a7452c5SDavid Hildenbrandwrite mode to serialise memory hotplug (e.g. access to global/zone 117*3a7452c5SDavid Hildenbrandvariables). 118*3a7452c5SDavid Hildenbrand 119*3a7452c5SDavid HildenbrandIn addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read 120*3a7452c5SDavid Hildenbrandmode allows for a quite efficient get_online_mems/put_online_mems 121*3a7452c5SDavid Hildenbrandimplementation, so code accessing memory can protect from that memory 122*3a7452c5SDavid Hildenbrandvanishing. 123