1c3cbf1a7SSilvio Fricke======================================== 2c3cbf1a7SSilvio FrickeGeneric Associative Array Implementation 3c3cbf1a7SSilvio Fricke======================================== 4c3cbf1a7SSilvio Fricke 5c3cbf1a7SSilvio FrickeOverview 6c3cbf1a7SSilvio Fricke======== 7c3cbf1a7SSilvio Fricke 8c3cbf1a7SSilvio FrickeThis associative array implementation is an object container with the following 9c3cbf1a7SSilvio Frickeproperties: 10c3cbf1a7SSilvio Fricke 11c3cbf1a7SSilvio Fricke1. Objects are opaque pointers. The implementation does not care where they 12c3cbf1a7SSilvio Fricke point (if anywhere) or what they point to (if anything). 13e3bb40c0SMarkus Heiser 14e3bb40c0SMarkus Heiser .. note:: 15e3bb40c0SMarkus Heiser 16e3bb40c0SMarkus Heiser Pointers to objects _must_ be zero in the least significant bit. 17c3cbf1a7SSilvio Fricke 18c3cbf1a7SSilvio Fricke2. Objects do not need to contain linkage blocks for use by the array. This 19c3cbf1a7SSilvio Fricke permits an object to be located in multiple arrays simultaneously. 20c3cbf1a7SSilvio Fricke Rather, the array is made up of metadata blocks that point to objects. 21c3cbf1a7SSilvio Fricke 22c3cbf1a7SSilvio Fricke3. Objects require index keys to locate them within the array. 23c3cbf1a7SSilvio Fricke 24c3cbf1a7SSilvio Fricke4. Index keys must be unique. Inserting an object with the same key as one 25c3cbf1a7SSilvio Fricke already in the array will replace the old object. 26c3cbf1a7SSilvio Fricke 27c3cbf1a7SSilvio Fricke5. Index keys can be of any length and can be of different lengths. 28c3cbf1a7SSilvio Fricke 29c3cbf1a7SSilvio Fricke6. Index keys should encode the length early on, before any variation due to 30c3cbf1a7SSilvio Fricke length is seen. 31c3cbf1a7SSilvio Fricke 32c3cbf1a7SSilvio Fricke7. Index keys can include a hash to scatter objects throughout the array. 33c3cbf1a7SSilvio Fricke 34c3cbf1a7SSilvio Fricke8. The array can iterated over. The objects will not necessarily come out in 35c3cbf1a7SSilvio Fricke key order. 36c3cbf1a7SSilvio Fricke 37*806654a9SWill Deacon9. The array can be iterated over while it is being modified, provided the 38c3cbf1a7SSilvio Fricke RCU readlock is being held by the iterator. Note, however, under these 39c3cbf1a7SSilvio Fricke circumstances, some objects may be seen more than once. If this is a 40c3cbf1a7SSilvio Fricke problem, the iterator should lock against modification. Objects will not 41c3cbf1a7SSilvio Fricke be missed, however, unless deleted. 42c3cbf1a7SSilvio Fricke 43c3cbf1a7SSilvio Fricke10. Objects in the array can be looked up by means of their index key. 44c3cbf1a7SSilvio Fricke 45*806654a9SWill Deacon11. Objects can be looked up while the array is being modified, provided the 46c3cbf1a7SSilvio Fricke RCU readlock is being held by the thread doing the look up. 47c3cbf1a7SSilvio Fricke 48c3cbf1a7SSilvio FrickeThe implementation uses a tree of 16-pointer nodes internally that are indexed 49c3cbf1a7SSilvio Frickeon each level by nibbles from the index key in the same manner as in a radix 50c3cbf1a7SSilvio Fricketree. To improve memory efficiency, shortcuts can be emplaced to skip over 51c3cbf1a7SSilvio Frickewhat would otherwise be a series of single-occupancy nodes. Further, nodes 52c3cbf1a7SSilvio Frickepack leaf object pointers into spare space in the node rather than making an 53c3cbf1a7SSilvio Frickeextra branch until as such time an object needs to be added to a full node. 54c3cbf1a7SSilvio Fricke 55c3cbf1a7SSilvio Fricke 56c3cbf1a7SSilvio FrickeThe Public API 57c3cbf1a7SSilvio Fricke============== 58c3cbf1a7SSilvio Fricke 59c3cbf1a7SSilvio FrickeThe public API can be found in ``<linux/assoc_array.h>``. The associative 60c3cbf1a7SSilvio Frickearray is rooted on the following structure:: 61c3cbf1a7SSilvio Fricke 62c3cbf1a7SSilvio Fricke struct assoc_array { 63c3cbf1a7SSilvio Fricke ... 64c3cbf1a7SSilvio Fricke }; 65c3cbf1a7SSilvio Fricke 66c3cbf1a7SSilvio FrickeThe code is selected by enabling ``CONFIG_ASSOCIATIVE_ARRAY`` with:: 67c3cbf1a7SSilvio Fricke 68c3cbf1a7SSilvio Fricke ./script/config -e ASSOCIATIVE_ARRAY 69c3cbf1a7SSilvio Fricke 70c3cbf1a7SSilvio Fricke 71c3cbf1a7SSilvio FrickeEdit Script 72c3cbf1a7SSilvio Fricke----------- 73c3cbf1a7SSilvio Fricke 74c3cbf1a7SSilvio FrickeThe insertion and deletion functions produce an 'edit script' that can later be 75c3cbf1a7SSilvio Frickeapplied to effect the changes without risking ``ENOMEM``. This retains the 76c3cbf1a7SSilvio Frickepreallocated metadata blocks that will be installed in the internal tree and 77c3cbf1a7SSilvio Frickekeeps track of the metadata blocks that will be removed from the tree when the 78c3cbf1a7SSilvio Frickescript is applied. 79c3cbf1a7SSilvio Fricke 80c3cbf1a7SSilvio FrickeThis is also used to keep track of dead blocks and dead objects after the 81c3cbf1a7SSilvio Frickescript has been applied so that they can be freed later. The freeing is done 82c3cbf1a7SSilvio Frickeafter an RCU grace period has passed - thus allowing access functions to 83c3cbf1a7SSilvio Frickeproceed under the RCU read lock. 84c3cbf1a7SSilvio Fricke 85c3cbf1a7SSilvio FrickeThe script appears as outside of the API as a pointer of the type:: 86c3cbf1a7SSilvio Fricke 87c3cbf1a7SSilvio Fricke struct assoc_array_edit; 88c3cbf1a7SSilvio Fricke 89c3cbf1a7SSilvio FrickeThere are two functions for dealing with the script: 90c3cbf1a7SSilvio Fricke 91c3cbf1a7SSilvio Fricke1. Apply an edit script:: 92c3cbf1a7SSilvio Fricke 93c3cbf1a7SSilvio Fricke void assoc_array_apply_edit(struct assoc_array_edit *edit); 94c3cbf1a7SSilvio Fricke 95c3cbf1a7SSilvio FrickeThis will perform the edit functions, interpolating various write barriers 96c3cbf1a7SSilvio Fricketo permit accesses under the RCU read lock to continue. The edit script 97c3cbf1a7SSilvio Frickewill then be passed to ``call_rcu()`` to free it and any dead stuff it points 98c3cbf1a7SSilvio Fricketo. 99c3cbf1a7SSilvio Fricke 100c3cbf1a7SSilvio Fricke2. Cancel an edit script:: 101c3cbf1a7SSilvio Fricke 102c3cbf1a7SSilvio Fricke void assoc_array_cancel_edit(struct assoc_array_edit *edit); 103c3cbf1a7SSilvio Fricke 104c3cbf1a7SSilvio FrickeThis frees the edit script and all preallocated memory immediately. If 105c3cbf1a7SSilvio Frickethis was for insertion, the new object is _not_ released by this function, 106c3cbf1a7SSilvio Frickebut must rather be released by the caller. 107c3cbf1a7SSilvio Fricke 108c3cbf1a7SSilvio FrickeThese functions are guaranteed not to fail. 109c3cbf1a7SSilvio Fricke 110c3cbf1a7SSilvio Fricke 111c3cbf1a7SSilvio FrickeOperations Table 112c3cbf1a7SSilvio Fricke---------------- 113c3cbf1a7SSilvio Fricke 114c3cbf1a7SSilvio FrickeVarious functions take a table of operations:: 115c3cbf1a7SSilvio Fricke 116c3cbf1a7SSilvio Fricke struct assoc_array_ops { 117c3cbf1a7SSilvio Fricke ... 118c3cbf1a7SSilvio Fricke }; 119c3cbf1a7SSilvio Fricke 120c3cbf1a7SSilvio FrickeThis points to a number of methods, all of which need to be provided: 121c3cbf1a7SSilvio Fricke 122c3cbf1a7SSilvio Fricke1. Get a chunk of index key from caller data:: 123c3cbf1a7SSilvio Fricke 124c3cbf1a7SSilvio Fricke unsigned long (*get_key_chunk)(const void *index_key, int level); 125c3cbf1a7SSilvio Fricke 126c3cbf1a7SSilvio FrickeThis should return a chunk of caller-supplied index key starting at the 127c3cbf1a7SSilvio Fricke*bit* position given by the level argument. The level argument will be a 128c3cbf1a7SSilvio Frickemultiple of ``ASSOC_ARRAY_KEY_CHUNK_SIZE`` and the function should return 129c3cbf1a7SSilvio Fricke``ASSOC_ARRAY_KEY_CHUNK_SIZE bits``. No error is possible. 130c3cbf1a7SSilvio Fricke 131c3cbf1a7SSilvio Fricke 132c3cbf1a7SSilvio Fricke2. Get a chunk of an object's index key:: 133c3cbf1a7SSilvio Fricke 134c3cbf1a7SSilvio Fricke unsigned long (*get_object_key_chunk)(const void *object, int level); 135c3cbf1a7SSilvio Fricke 136c3cbf1a7SSilvio FrickeAs the previous function, but gets its data from an object in the array 137c3cbf1a7SSilvio Frickerather than from a caller-supplied index key. 138c3cbf1a7SSilvio Fricke 139c3cbf1a7SSilvio Fricke 140c3cbf1a7SSilvio Fricke3. See if this is the object we're looking for:: 141c3cbf1a7SSilvio Fricke 142c3cbf1a7SSilvio Fricke bool (*compare_object)(const void *object, const void *index_key); 143c3cbf1a7SSilvio Fricke 144c3cbf1a7SSilvio FrickeCompare the object against an index key and return ``true`` if it matches and 145c3cbf1a7SSilvio Fricke``false`` if it doesn't. 146c3cbf1a7SSilvio Fricke 147c3cbf1a7SSilvio Fricke 148c3cbf1a7SSilvio Fricke4. Diff the index keys of two objects:: 149c3cbf1a7SSilvio Fricke 150c3cbf1a7SSilvio Fricke int (*diff_objects)(const void *object, const void *index_key); 151c3cbf1a7SSilvio Fricke 152c3cbf1a7SSilvio FrickeReturn the bit position at which the index key of the specified object 153c3cbf1a7SSilvio Frickediffers from the given index key or -1 if they are the same. 154c3cbf1a7SSilvio Fricke 155c3cbf1a7SSilvio Fricke 156c3cbf1a7SSilvio Fricke5. Free an object:: 157c3cbf1a7SSilvio Fricke 158c3cbf1a7SSilvio Fricke void (*free_object)(void *object); 159c3cbf1a7SSilvio Fricke 160c3cbf1a7SSilvio FrickeFree the specified object. Note that this may be called an RCU grace period 161c3cbf1a7SSilvio Frickeafter ``assoc_array_apply_edit()`` was called, so ``synchronize_rcu()`` may be 162c3cbf1a7SSilvio Frickenecessary on module unloading. 163c3cbf1a7SSilvio Fricke 164c3cbf1a7SSilvio Fricke 165c3cbf1a7SSilvio FrickeManipulation Functions 166c3cbf1a7SSilvio Fricke---------------------- 167c3cbf1a7SSilvio Fricke 168c3cbf1a7SSilvio FrickeThere are a number of functions for manipulating an associative array: 169c3cbf1a7SSilvio Fricke 170c3cbf1a7SSilvio Fricke1. Initialise an associative array:: 171c3cbf1a7SSilvio Fricke 172c3cbf1a7SSilvio Fricke void assoc_array_init(struct assoc_array *array); 173c3cbf1a7SSilvio Fricke 174c3cbf1a7SSilvio FrickeThis initialises the base structure for an associative array. It can't fail. 175c3cbf1a7SSilvio Fricke 176c3cbf1a7SSilvio Fricke 177c3cbf1a7SSilvio Fricke2. Insert/replace an object in an associative array:: 178c3cbf1a7SSilvio Fricke 179c3cbf1a7SSilvio Fricke struct assoc_array_edit * 180c3cbf1a7SSilvio Fricke assoc_array_insert(struct assoc_array *array, 181c3cbf1a7SSilvio Fricke const struct assoc_array_ops *ops, 182c3cbf1a7SSilvio Fricke const void *index_key, 183c3cbf1a7SSilvio Fricke void *object); 184c3cbf1a7SSilvio Fricke 185c3cbf1a7SSilvio FrickeThis inserts the given object into the array. Note that the least 186c3cbf1a7SSilvio Frickesignificant bit of the pointer must be zero as it's used to type-mark 187c3cbf1a7SSilvio Frickepointers internally. 188c3cbf1a7SSilvio Fricke 189c3cbf1a7SSilvio FrickeIf an object already exists for that key then it will be replaced with the 190c3cbf1a7SSilvio Frickenew object and the old one will be freed automatically. 191c3cbf1a7SSilvio Fricke 192c3cbf1a7SSilvio FrickeThe ``index_key`` argument should hold index key information and is 193c3cbf1a7SSilvio Frickepassed to the methods in the ops table when they are called. 194c3cbf1a7SSilvio Fricke 195c3cbf1a7SSilvio FrickeThis function makes no alteration to the array itself, but rather returns 196c3cbf1a7SSilvio Frickean edit script that must be applied. ``-ENOMEM`` is returned in the case of 197c3cbf1a7SSilvio Frickean out-of-memory error. 198c3cbf1a7SSilvio Fricke 199c3cbf1a7SSilvio FrickeThe caller should lock exclusively against other modifiers of the array. 200c3cbf1a7SSilvio Fricke 201c3cbf1a7SSilvio Fricke 202c3cbf1a7SSilvio Fricke3. Delete an object from an associative array:: 203c3cbf1a7SSilvio Fricke 204c3cbf1a7SSilvio Fricke struct assoc_array_edit * 205c3cbf1a7SSilvio Fricke assoc_array_delete(struct assoc_array *array, 206c3cbf1a7SSilvio Fricke const struct assoc_array_ops *ops, 207c3cbf1a7SSilvio Fricke const void *index_key); 208c3cbf1a7SSilvio Fricke 209c3cbf1a7SSilvio FrickeThis deletes an object that matches the specified data from the array. 210c3cbf1a7SSilvio Fricke 211c3cbf1a7SSilvio FrickeThe ``index_key`` argument should hold index key information and is 212c3cbf1a7SSilvio Frickepassed to the methods in the ops table when they are called. 213c3cbf1a7SSilvio Fricke 214c3cbf1a7SSilvio FrickeThis function makes no alteration to the array itself, but rather returns 215c3cbf1a7SSilvio Frickean edit script that must be applied. ``-ENOMEM`` is returned in the case of 216c3cbf1a7SSilvio Frickean out-of-memory error. ``NULL`` will be returned if the specified object is 217c3cbf1a7SSilvio Frickenot found within the array. 218c3cbf1a7SSilvio Fricke 219c3cbf1a7SSilvio FrickeThe caller should lock exclusively against other modifiers of the array. 220c3cbf1a7SSilvio Fricke 221c3cbf1a7SSilvio Fricke 222c3cbf1a7SSilvio Fricke4. Delete all objects from an associative array:: 223c3cbf1a7SSilvio Fricke 224c3cbf1a7SSilvio Fricke struct assoc_array_edit * 225c3cbf1a7SSilvio Fricke assoc_array_clear(struct assoc_array *array, 226c3cbf1a7SSilvio Fricke const struct assoc_array_ops *ops); 227c3cbf1a7SSilvio Fricke 228c3cbf1a7SSilvio FrickeThis deletes all the objects from an associative array and leaves it 229c3cbf1a7SSilvio Frickecompletely empty. 230c3cbf1a7SSilvio Fricke 231c3cbf1a7SSilvio FrickeThis function makes no alteration to the array itself, but rather returns 232c3cbf1a7SSilvio Frickean edit script that must be applied. ``-ENOMEM`` is returned in the case of 233c3cbf1a7SSilvio Frickean out-of-memory error. 234c3cbf1a7SSilvio Fricke 235c3cbf1a7SSilvio FrickeThe caller should lock exclusively against other modifiers of the array. 236c3cbf1a7SSilvio Fricke 237c3cbf1a7SSilvio Fricke 238c3cbf1a7SSilvio Fricke5. Destroy an associative array, deleting all objects:: 239c3cbf1a7SSilvio Fricke 240c3cbf1a7SSilvio Fricke void assoc_array_destroy(struct assoc_array *array, 241c3cbf1a7SSilvio Fricke const struct assoc_array_ops *ops); 242c3cbf1a7SSilvio Fricke 243c3cbf1a7SSilvio FrickeThis destroys the contents of the associative array and leaves it 244c3cbf1a7SSilvio Frickecompletely empty. It is not permitted for another thread to be traversing 245c3cbf1a7SSilvio Frickethe array under the RCU read lock at the same time as this function is 246c3cbf1a7SSilvio Frickedestroying it as no RCU deferral is performed on memory release - 247c3cbf1a7SSilvio Frickesomething that would require memory to be allocated. 248c3cbf1a7SSilvio Fricke 249c3cbf1a7SSilvio FrickeThe caller should lock exclusively against other modifiers and accessors 250c3cbf1a7SSilvio Frickeof the array. 251c3cbf1a7SSilvio Fricke 252c3cbf1a7SSilvio Fricke 253c3cbf1a7SSilvio Fricke6. Garbage collect an associative array:: 254c3cbf1a7SSilvio Fricke 255c3cbf1a7SSilvio Fricke int assoc_array_gc(struct assoc_array *array, 256c3cbf1a7SSilvio Fricke const struct assoc_array_ops *ops, 257c3cbf1a7SSilvio Fricke bool (*iterator)(void *object, void *iterator_data), 258c3cbf1a7SSilvio Fricke void *iterator_data); 259c3cbf1a7SSilvio Fricke 260c3cbf1a7SSilvio FrickeThis iterates over the objects in an associative array and passes each one to 261c3cbf1a7SSilvio Fricke``iterator()``. If ``iterator()`` returns ``true``, the object is kept. If it 262c3cbf1a7SSilvio Frickereturns ``false``, the object will be freed. If the ``iterator()`` function 263c3cbf1a7SSilvio Frickereturns ``true``, it must perform any appropriate refcount incrementing on the 264c3cbf1a7SSilvio Frickeobject before returning. 265c3cbf1a7SSilvio Fricke 266c3cbf1a7SSilvio FrickeThe internal tree will be packed down if possible as part of the iteration 267c3cbf1a7SSilvio Fricketo reduce the number of nodes in it. 268c3cbf1a7SSilvio Fricke 269c3cbf1a7SSilvio FrickeThe ``iterator_data`` is passed directly to ``iterator()`` and is otherwise 270c3cbf1a7SSilvio Frickeignored by the function. 271c3cbf1a7SSilvio Fricke 272c3cbf1a7SSilvio FrickeThe function will return ``0`` if successful and ``-ENOMEM`` if there wasn't 273c3cbf1a7SSilvio Frickeenough memory. 274c3cbf1a7SSilvio Fricke 275c3cbf1a7SSilvio FrickeIt is possible for other threads to iterate over or search the array under 276*806654a9SWill Deaconthe RCU read lock while this function is in progress. The caller should 277c3cbf1a7SSilvio Frickelock exclusively against other modifiers of the array. 278c3cbf1a7SSilvio Fricke 279c3cbf1a7SSilvio Fricke 280c3cbf1a7SSilvio FrickeAccess Functions 281c3cbf1a7SSilvio Fricke---------------- 282c3cbf1a7SSilvio Fricke 283c3cbf1a7SSilvio FrickeThere are two functions for accessing an associative array: 284c3cbf1a7SSilvio Fricke 285c3cbf1a7SSilvio Fricke1. Iterate over all the objects in an associative array:: 286c3cbf1a7SSilvio Fricke 287c3cbf1a7SSilvio Fricke int assoc_array_iterate(const struct assoc_array *array, 288c3cbf1a7SSilvio Fricke int (*iterator)(const void *object, 289c3cbf1a7SSilvio Fricke void *iterator_data), 290c3cbf1a7SSilvio Fricke void *iterator_data); 291c3cbf1a7SSilvio Fricke 292c3cbf1a7SSilvio FrickeThis passes each object in the array to the iterator callback function. 293c3cbf1a7SSilvio Fricke``iterator_data`` is private data for that function. 294c3cbf1a7SSilvio Fricke 295c3cbf1a7SSilvio FrickeThis may be used on an array at the same time as the array is being 296c3cbf1a7SSilvio Frickemodified, provided the RCU read lock is held. Under such circumstances, 297c3cbf1a7SSilvio Frickeit is possible for the iteration function to see some objects twice. If 298c3cbf1a7SSilvio Frickethis is a problem, then modification should be locked against. The 299c3cbf1a7SSilvio Frickeiteration algorithm should not, however, miss any objects. 300c3cbf1a7SSilvio Fricke 301c3cbf1a7SSilvio FrickeThe function will return ``0`` if no objects were in the array or else it will 302c3cbf1a7SSilvio Frickereturn the result of the last iterator function called. Iteration stops 303c3cbf1a7SSilvio Frickeimmediately if any call to the iteration function results in a non-zero 304c3cbf1a7SSilvio Frickereturn. 305c3cbf1a7SSilvio Fricke 306c3cbf1a7SSilvio Fricke 307c3cbf1a7SSilvio Fricke2. Find an object in an associative array:: 308c3cbf1a7SSilvio Fricke 309c3cbf1a7SSilvio Fricke void *assoc_array_find(const struct assoc_array *array, 310c3cbf1a7SSilvio Fricke const struct assoc_array_ops *ops, 311c3cbf1a7SSilvio Fricke const void *index_key); 312c3cbf1a7SSilvio Fricke 313c3cbf1a7SSilvio FrickeThis walks through the array's internal tree directly to the object 314c3cbf1a7SSilvio Frickespecified by the index key.. 315c3cbf1a7SSilvio Fricke 316c3cbf1a7SSilvio FrickeThis may be used on an array at the same time as the array is being 317c3cbf1a7SSilvio Frickemodified, provided the RCU read lock is held. 318c3cbf1a7SSilvio Fricke 319c3cbf1a7SSilvio FrickeThe function will return the object if found (and set ``*_type`` to the object 320c3cbf1a7SSilvio Fricketype) or will return ``NULL`` if the object was not found. 321c3cbf1a7SSilvio Fricke 322c3cbf1a7SSilvio Fricke 323c3cbf1a7SSilvio FrickeIndex Key Form 324c3cbf1a7SSilvio Fricke-------------- 325c3cbf1a7SSilvio Fricke 326c3cbf1a7SSilvio FrickeThe index key can be of any form, but since the algorithms aren't told how long 327c3cbf1a7SSilvio Frickethe key is, it is strongly recommended that the index key includes its length 328c3cbf1a7SSilvio Frickevery early on before any variation due to the length would have an effect on 329c3cbf1a7SSilvio Frickecomparisons. 330c3cbf1a7SSilvio Fricke 331c3cbf1a7SSilvio FrickeThis will cause leaves with different length keys to scatter away from each 332c3cbf1a7SSilvio Frickeother - and those with the same length keys to cluster together. 333c3cbf1a7SSilvio Fricke 334c3cbf1a7SSilvio FrickeIt is also recommended that the index key begin with a hash of the rest of the 335c3cbf1a7SSilvio Frickekey to maximise scattering throughout keyspace. 336c3cbf1a7SSilvio Fricke 337c3cbf1a7SSilvio FrickeThe better the scattering, the wider and lower the internal tree will be. 338c3cbf1a7SSilvio Fricke 339c3cbf1a7SSilvio FrickePoor scattering isn't too much of a problem as there are shortcuts and nodes 340c3cbf1a7SSilvio Frickecan contain mixtures of leaves and metadata pointers. 341c3cbf1a7SSilvio Fricke 342c3cbf1a7SSilvio FrickeThe index key is read in chunks of machine word. Each chunk is subdivided into 343c3cbf1a7SSilvio Frickeone nibble (4 bits) per level, so on a 32-bit CPU this is good for 8 levels and 344c3cbf1a7SSilvio Frickeon a 64-bit CPU, 16 levels. Unless the scattering is really poor, it is 345c3cbf1a7SSilvio Frickeunlikely that more than one word of any particular index key will have to be 346c3cbf1a7SSilvio Frickeused. 347c3cbf1a7SSilvio Fricke 348c3cbf1a7SSilvio Fricke 349c3cbf1a7SSilvio FrickeInternal Workings 350c3cbf1a7SSilvio Fricke================= 351c3cbf1a7SSilvio Fricke 352c3cbf1a7SSilvio FrickeThe associative array data structure has an internal tree. This tree is 353c3cbf1a7SSilvio Frickeconstructed of two types of metadata blocks: nodes and shortcuts. 354c3cbf1a7SSilvio Fricke 355c3cbf1a7SSilvio FrickeA node is an array of slots. Each slot can contain one of four things: 356c3cbf1a7SSilvio Fricke 357c3cbf1a7SSilvio Fricke* A NULL pointer, indicating that the slot is empty. 358c3cbf1a7SSilvio Fricke* A pointer to an object (a leaf). 359c3cbf1a7SSilvio Fricke* A pointer to a node at the next level. 360c3cbf1a7SSilvio Fricke* A pointer to a shortcut. 361c3cbf1a7SSilvio Fricke 362c3cbf1a7SSilvio Fricke 363c3cbf1a7SSilvio FrickeBasic Internal Tree Layout 364c3cbf1a7SSilvio Fricke-------------------------- 365c3cbf1a7SSilvio Fricke 366c3cbf1a7SSilvio FrickeIgnoring shortcuts for the moment, the nodes form a multilevel tree. The index 367c3cbf1a7SSilvio Frickekey space is strictly subdivided by the nodes in the tree and nodes occur on 368c3cbf1a7SSilvio Frickefixed levels. For example:: 369c3cbf1a7SSilvio Fricke 370c3cbf1a7SSilvio Fricke Level: 0 1 2 3 371c3cbf1a7SSilvio Fricke =============== =============== =============== =============== 372c3cbf1a7SSilvio Fricke NODE D 373c3cbf1a7SSilvio Fricke NODE B NODE C +------>+---+ 374c3cbf1a7SSilvio Fricke +------>+---+ +------>+---+ | | 0 | 375c3cbf1a7SSilvio Fricke NODE A | | 0 | | | 0 | | +---+ 376c3cbf1a7SSilvio Fricke +---+ | +---+ | +---+ | : : 377c3cbf1a7SSilvio Fricke | 0 | | : : | : : | +---+ 378c3cbf1a7SSilvio Fricke +---+ | +---+ | +---+ | | f | 379c3cbf1a7SSilvio Fricke | 1 |---+ | 3 |---+ | 7 |---+ +---+ 380c3cbf1a7SSilvio Fricke +---+ +---+ +---+ 381c3cbf1a7SSilvio Fricke : : : : | 8 |---+ 382c3cbf1a7SSilvio Fricke +---+ +---+ +---+ | NODE E 383c3cbf1a7SSilvio Fricke | e |---+ | f | : : +------>+---+ 384c3cbf1a7SSilvio Fricke +---+ | +---+ +---+ | 0 | 385c3cbf1a7SSilvio Fricke | f | | | f | +---+ 386c3cbf1a7SSilvio Fricke +---+ | +---+ : : 387c3cbf1a7SSilvio Fricke | NODE F +---+ 388c3cbf1a7SSilvio Fricke +------>+---+ | f | 389c3cbf1a7SSilvio Fricke | 0 | NODE G +---+ 390c3cbf1a7SSilvio Fricke +---+ +------>+---+ 391c3cbf1a7SSilvio Fricke : : | | 0 | 392c3cbf1a7SSilvio Fricke +---+ | +---+ 393c3cbf1a7SSilvio Fricke | 6 |---+ : : 394c3cbf1a7SSilvio Fricke +---+ +---+ 395c3cbf1a7SSilvio Fricke : : | f | 396c3cbf1a7SSilvio Fricke +---+ +---+ 397c3cbf1a7SSilvio Fricke | f | 398c3cbf1a7SSilvio Fricke +---+ 399c3cbf1a7SSilvio Fricke 400c3cbf1a7SSilvio FrickeIn the above example, there are 7 nodes (A-G), each with 16 slots (0-f). 401c3cbf1a7SSilvio FrickeAssuming no other meta data nodes in the tree, the key space is divided 402c3cbf1a7SSilvio Frickethusly:: 403c3cbf1a7SSilvio Fricke 404c3cbf1a7SSilvio Fricke KEY PREFIX NODE 405c3cbf1a7SSilvio Fricke ========== ==== 406c3cbf1a7SSilvio Fricke 137* D 407c3cbf1a7SSilvio Fricke 138* E 408c3cbf1a7SSilvio Fricke 13[0-69-f]* C 409c3cbf1a7SSilvio Fricke 1[0-24-f]* B 410c3cbf1a7SSilvio Fricke e6* G 411c3cbf1a7SSilvio Fricke e[0-57-f]* F 412c3cbf1a7SSilvio Fricke [02-df]* A 413c3cbf1a7SSilvio Fricke 414c3cbf1a7SSilvio FrickeSo, for instance, keys with the following example index keys will be found in 415c3cbf1a7SSilvio Frickethe appropriate nodes:: 416c3cbf1a7SSilvio Fricke 417c3cbf1a7SSilvio Fricke INDEX KEY PREFIX NODE 418c3cbf1a7SSilvio Fricke =============== ======= ==== 419c3cbf1a7SSilvio Fricke 13694892892489 13 C 420c3cbf1a7SSilvio Fricke 13795289025897 137 D 421c3cbf1a7SSilvio Fricke 13889dde88793 138 E 422c3cbf1a7SSilvio Fricke 138bbb89003093 138 E 423c3cbf1a7SSilvio Fricke 1394879524789 12 C 424c3cbf1a7SSilvio Fricke 1458952489 1 B 425c3cbf1a7SSilvio Fricke 9431809de993ba - A 426c3cbf1a7SSilvio Fricke b4542910809cd - A 427c3cbf1a7SSilvio Fricke e5284310def98 e F 428c3cbf1a7SSilvio Fricke e68428974237 e6 G 429c3cbf1a7SSilvio Fricke e7fffcbd443 e F 430c3cbf1a7SSilvio Fricke f3842239082 - A 431c3cbf1a7SSilvio Fricke 432c3cbf1a7SSilvio FrickeTo save memory, if a node can hold all the leaves in its portion of keyspace, 433c3cbf1a7SSilvio Frickethen the node will have all those leaves in it and will not have any metadata 434c3cbf1a7SSilvio Frickepointers - even if some of those leaves would like to be in the same slot. 435c3cbf1a7SSilvio Fricke 436c3cbf1a7SSilvio FrickeA node can contain a heterogeneous mix of leaves and metadata pointers. 437c3cbf1a7SSilvio FrickeMetadata pointers must be in the slots that match their subdivisions of key 438c3cbf1a7SSilvio Frickespace. The leaves can be in any slot not occupied by a metadata pointer. It 439c3cbf1a7SSilvio Frickeis guaranteed that none of the leaves in a node will match a slot occupied by a 440c3cbf1a7SSilvio Frickemetadata pointer. If the metadata pointer is there, any leaf whose key matches 441c3cbf1a7SSilvio Frickethe metadata key prefix must be in the subtree that the metadata pointer points 442c3cbf1a7SSilvio Fricketo. 443c3cbf1a7SSilvio Fricke 444c3cbf1a7SSilvio FrickeIn the above example list of index keys, node A will contain:: 445c3cbf1a7SSilvio Fricke 446c3cbf1a7SSilvio Fricke SLOT CONTENT INDEX KEY (PREFIX) 447c3cbf1a7SSilvio Fricke ==== =============== ================== 448c3cbf1a7SSilvio Fricke 1 PTR TO NODE B 1* 449c3cbf1a7SSilvio Fricke any LEAF 9431809de993ba 450c3cbf1a7SSilvio Fricke any LEAF b4542910809cd 451c3cbf1a7SSilvio Fricke e PTR TO NODE F e* 452c3cbf1a7SSilvio Fricke any LEAF f3842239082 453c3cbf1a7SSilvio Fricke 454c3cbf1a7SSilvio Frickeand node B:: 455c3cbf1a7SSilvio Fricke 456c3cbf1a7SSilvio Fricke 3 PTR TO NODE C 13* 457c3cbf1a7SSilvio Fricke any LEAF 1458952489 458c3cbf1a7SSilvio Fricke 459c3cbf1a7SSilvio Fricke 460c3cbf1a7SSilvio FrickeShortcuts 461c3cbf1a7SSilvio Fricke--------- 462c3cbf1a7SSilvio Fricke 463c3cbf1a7SSilvio FrickeShortcuts are metadata records that jump over a piece of keyspace. A shortcut 464c3cbf1a7SSilvio Frickeis a replacement for a series of single-occupancy nodes ascending through the 465c3cbf1a7SSilvio Frickelevels. Shortcuts exist to save memory and to speed up traversal. 466c3cbf1a7SSilvio Fricke 467c3cbf1a7SSilvio FrickeIt is possible for the root of the tree to be a shortcut - say, for example, 468c3cbf1a7SSilvio Frickethe tree contains at least 17 nodes all with key prefix ``1111``. The 469c3cbf1a7SSilvio Frickeinsertion algorithm will insert a shortcut to skip over the ``1111`` keyspace 470c3cbf1a7SSilvio Frickein a single bound and get to the fourth level where these actually become 471c3cbf1a7SSilvio Frickedifferent. 472c3cbf1a7SSilvio Fricke 473c3cbf1a7SSilvio Fricke 474c3cbf1a7SSilvio FrickeSplitting And Collapsing Nodes 475c3cbf1a7SSilvio Fricke------------------------------ 476c3cbf1a7SSilvio Fricke 477c3cbf1a7SSilvio FrickeEach node has a maximum capacity of 16 leaves and metadata pointers. If the 478c3cbf1a7SSilvio Frickeinsertion algorithm finds that it is trying to insert a 17th object into a 479c3cbf1a7SSilvio Frickenode, that node will be split such that at least two leaves that have a common 480c3cbf1a7SSilvio Frickekey segment at that level end up in a separate node rooted on that slot for 481c3cbf1a7SSilvio Frickethat common key segment. 482c3cbf1a7SSilvio Fricke 483c3cbf1a7SSilvio FrickeIf the leaves in a full node and the leaf that is being inserted are 484c3cbf1a7SSilvio Frickesufficiently similar, then a shortcut will be inserted into the tree. 485c3cbf1a7SSilvio Fricke 486c3cbf1a7SSilvio FrickeWhen the number of objects in the subtree rooted at a node falls to 16 or 487c3cbf1a7SSilvio Frickefewer, then the subtree will be collapsed down to a single node - and this will 488c3cbf1a7SSilvio Frickeripple towards the root if possible. 489c3cbf1a7SSilvio Fricke 490c3cbf1a7SSilvio Fricke 491c3cbf1a7SSilvio FrickeNon-Recursive Iteration 492c3cbf1a7SSilvio Fricke----------------------- 493c3cbf1a7SSilvio Fricke 494c3cbf1a7SSilvio FrickeEach node and shortcut contains a back pointer to its parent and the number of 495c3cbf1a7SSilvio Frickeslot in that parent that points to it. None-recursive iteration uses these to 496c3cbf1a7SSilvio Frickeproceed rootwards through the tree, going to the parent node, slot N + 1 to 497c3cbf1a7SSilvio Frickemake sure progress is made without the need for a stack. 498c3cbf1a7SSilvio Fricke 499c3cbf1a7SSilvio FrickeThe backpointers, however, make simultaneous alteration and iteration tricky. 500c3cbf1a7SSilvio Fricke 501c3cbf1a7SSilvio Fricke 502c3cbf1a7SSilvio FrickeSimultaneous Alteration And Iteration 503c3cbf1a7SSilvio Fricke------------------------------------- 504c3cbf1a7SSilvio Fricke 505c3cbf1a7SSilvio FrickeThere are a number of cases to consider: 506c3cbf1a7SSilvio Fricke 507c3cbf1a7SSilvio Fricke1. Simple insert/replace. This involves simply replacing a NULL or old 508c3cbf1a7SSilvio Fricke matching leaf pointer with the pointer to the new leaf after a barrier. 509c3cbf1a7SSilvio Fricke The metadata blocks don't change otherwise. An old leaf won't be freed 510c3cbf1a7SSilvio Fricke until after the RCU grace period. 511c3cbf1a7SSilvio Fricke 512c3cbf1a7SSilvio Fricke2. Simple delete. This involves just clearing an old matching leaf. The 513c3cbf1a7SSilvio Fricke metadata blocks don't change otherwise. The old leaf won't be freed until 514c3cbf1a7SSilvio Fricke after the RCU grace period. 515c3cbf1a7SSilvio Fricke 516c3cbf1a7SSilvio Fricke3. Insertion replacing part of a subtree that we haven't yet entered. This 517c3cbf1a7SSilvio Fricke may involve replacement of part of that subtree - but that won't affect 518c3cbf1a7SSilvio Fricke the iteration as we won't have reached the pointer to it yet and the 519c3cbf1a7SSilvio Fricke ancestry blocks are not replaced (the layout of those does not change). 520c3cbf1a7SSilvio Fricke 521c3cbf1a7SSilvio Fricke4. Insertion replacing nodes that we're actively processing. This isn't a 522c3cbf1a7SSilvio Fricke problem as we've passed the anchoring pointer and won't switch onto the 523c3cbf1a7SSilvio Fricke new layout until we follow the back pointers - at which point we've 524c3cbf1a7SSilvio Fricke already examined the leaves in the replaced node (we iterate over all the 525c3cbf1a7SSilvio Fricke leaves in a node before following any of its metadata pointers). 526c3cbf1a7SSilvio Fricke 527c3cbf1a7SSilvio Fricke We might, however, re-see some leaves that have been split out into a new 528c3cbf1a7SSilvio Fricke branch that's in a slot further along than we were at. 529c3cbf1a7SSilvio Fricke 530c3cbf1a7SSilvio Fricke5. Insertion replacing nodes that we're processing a dependent branch of. 531c3cbf1a7SSilvio Fricke This won't affect us until we follow the back pointers. Similar to (4). 532c3cbf1a7SSilvio Fricke 533c3cbf1a7SSilvio Fricke6. Deletion collapsing a branch under us. This doesn't affect us because the 534c3cbf1a7SSilvio Fricke back pointers will get us back to the parent of the new node before we 535c3cbf1a7SSilvio Fricke could see the new node. The entire collapsed subtree is thrown away 536c3cbf1a7SSilvio Fricke unchanged - and will still be rooted on the same slot, so we shouldn't 537c3cbf1a7SSilvio Fricke process it a second time as we'll go back to slot + 1. 538c3cbf1a7SSilvio Fricke 539c3cbf1a7SSilvio Fricke.. note:: 540c3cbf1a7SSilvio Fricke 541c3cbf1a7SSilvio Fricke Under some circumstances, we need to simultaneously change the parent 542c3cbf1a7SSilvio Fricke pointer and the parent slot pointer on a node (say, for example, we 543c3cbf1a7SSilvio Fricke inserted another node before it and moved it up a level). We cannot do 544c3cbf1a7SSilvio Fricke this without locking against a read - so we have to replace that node too. 545c3cbf1a7SSilvio Fricke 546c3cbf1a7SSilvio Fricke However, when we're changing a shortcut into a node this isn't a problem 547c3cbf1a7SSilvio Fricke as shortcuts only have one slot and so the parent slot number isn't used 548c3cbf1a7SSilvio Fricke when traversing backwards over one. This means that it's okay to change 549c3cbf1a7SSilvio Fricke the slot number first - provided suitable barriers are used to make sure 550c3cbf1a7SSilvio Fricke the parent slot number is read after the back pointer. 551c3cbf1a7SSilvio Fricke 552c3cbf1a7SSilvio FrickeObsolete blocks and leaves are freed up after an RCU grace period has passed, 553c3cbf1a7SSilvio Frickeso as long as anyone doing walking or iteration holds the RCU read lock, the 554c3cbf1a7SSilvio Frickeold superstructure should not go away on them. 555