1c3cbf1a7SSilvio Fricke========================================
2c3cbf1a7SSilvio FrickeGeneric Associative Array Implementation
3c3cbf1a7SSilvio Fricke========================================
4c3cbf1a7SSilvio Fricke
5c3cbf1a7SSilvio FrickeOverview
6c3cbf1a7SSilvio Fricke========
7c3cbf1a7SSilvio Fricke
8c3cbf1a7SSilvio FrickeThis associative array implementation is an object container with the following
9c3cbf1a7SSilvio Frickeproperties:
10c3cbf1a7SSilvio Fricke
11c3cbf1a7SSilvio Fricke1. Objects are opaque pointers.  The implementation does not care where they
12c3cbf1a7SSilvio Fricke   point (if anywhere) or what they point to (if anything).
13e3bb40c0SMarkus Heiser
14e3bb40c0SMarkus Heiser   .. note::
15e3bb40c0SMarkus Heiser
16e3bb40c0SMarkus Heiser      Pointers to objects _must_ be zero in the least significant bit.
17c3cbf1a7SSilvio Fricke
18c3cbf1a7SSilvio Fricke2. Objects do not need to contain linkage blocks for use by the array.  This
19c3cbf1a7SSilvio Fricke   permits an object to be located in multiple arrays simultaneously.
20c3cbf1a7SSilvio Fricke   Rather, the array is made up of metadata blocks that point to objects.
21c3cbf1a7SSilvio Fricke
22c3cbf1a7SSilvio Fricke3. Objects require index keys to locate them within the array.
23c3cbf1a7SSilvio Fricke
24c3cbf1a7SSilvio Fricke4. Index keys must be unique.  Inserting an object with the same key as one
25c3cbf1a7SSilvio Fricke   already in the array will replace the old object.
26c3cbf1a7SSilvio Fricke
27c3cbf1a7SSilvio Fricke5. Index keys can be of any length and can be of different lengths.
28c3cbf1a7SSilvio Fricke
29c3cbf1a7SSilvio Fricke6. Index keys should encode the length early on, before any variation due to
30c3cbf1a7SSilvio Fricke   length is seen.
31c3cbf1a7SSilvio Fricke
32c3cbf1a7SSilvio Fricke7. Index keys can include a hash to scatter objects throughout the array.
33c3cbf1a7SSilvio Fricke
34c3cbf1a7SSilvio Fricke8. The array can iterated over.  The objects will not necessarily come out in
35c3cbf1a7SSilvio Fricke   key order.
36c3cbf1a7SSilvio Fricke
37*806654a9SWill Deacon9. The array can be iterated over while it is being modified, provided the
38c3cbf1a7SSilvio Fricke   RCU readlock is being held by the iterator.  Note, however, under these
39c3cbf1a7SSilvio Fricke   circumstances, some objects may be seen more than once.  If this is a
40c3cbf1a7SSilvio Fricke   problem, the iterator should lock against modification.  Objects will not
41c3cbf1a7SSilvio Fricke   be missed, however, unless deleted.
42c3cbf1a7SSilvio Fricke
43c3cbf1a7SSilvio Fricke10. Objects in the array can be looked up by means of their index key.
44c3cbf1a7SSilvio Fricke
45*806654a9SWill Deacon11. Objects can be looked up while the array is being modified, provided the
46c3cbf1a7SSilvio Fricke    RCU readlock is being held by the thread doing the look up.
47c3cbf1a7SSilvio Fricke
48c3cbf1a7SSilvio FrickeThe implementation uses a tree of 16-pointer nodes internally that are indexed
49c3cbf1a7SSilvio Frickeon each level by nibbles from the index key in the same manner as in a radix
50c3cbf1a7SSilvio Fricketree.  To improve memory efficiency, shortcuts can be emplaced to skip over
51c3cbf1a7SSilvio Frickewhat would otherwise be a series of single-occupancy nodes.  Further, nodes
52c3cbf1a7SSilvio Frickepack leaf object pointers into spare space in the node rather than making an
53c3cbf1a7SSilvio Frickeextra branch until as such time an object needs to be added to a full node.
54c3cbf1a7SSilvio Fricke
55c3cbf1a7SSilvio Fricke
56c3cbf1a7SSilvio FrickeThe Public API
57c3cbf1a7SSilvio Fricke==============
58c3cbf1a7SSilvio Fricke
59c3cbf1a7SSilvio FrickeThe public API can be found in ``<linux/assoc_array.h>``.  The associative
60c3cbf1a7SSilvio Frickearray is rooted on the following structure::
61c3cbf1a7SSilvio Fricke
62c3cbf1a7SSilvio Fricke    struct assoc_array {
63c3cbf1a7SSilvio Fricke            ...
64c3cbf1a7SSilvio Fricke    };
65c3cbf1a7SSilvio Fricke
66c3cbf1a7SSilvio FrickeThe code is selected by enabling ``CONFIG_ASSOCIATIVE_ARRAY`` with::
67c3cbf1a7SSilvio Fricke
68c3cbf1a7SSilvio Fricke    ./script/config -e ASSOCIATIVE_ARRAY
69c3cbf1a7SSilvio Fricke
70c3cbf1a7SSilvio Fricke
71c3cbf1a7SSilvio FrickeEdit Script
72c3cbf1a7SSilvio Fricke-----------
73c3cbf1a7SSilvio Fricke
74c3cbf1a7SSilvio FrickeThe insertion and deletion functions produce an 'edit script' that can later be
75c3cbf1a7SSilvio Frickeapplied to effect the changes without risking ``ENOMEM``. This retains the
76c3cbf1a7SSilvio Frickepreallocated metadata blocks that will be installed in the internal tree and
77c3cbf1a7SSilvio Frickekeeps track of the metadata blocks that will be removed from the tree when the
78c3cbf1a7SSilvio Frickescript is applied.
79c3cbf1a7SSilvio Fricke
80c3cbf1a7SSilvio FrickeThis is also used to keep track of dead blocks and dead objects after the
81c3cbf1a7SSilvio Frickescript has been applied so that they can be freed later.  The freeing is done
82c3cbf1a7SSilvio Frickeafter an RCU grace period has passed - thus allowing access functions to
83c3cbf1a7SSilvio Frickeproceed under the RCU read lock.
84c3cbf1a7SSilvio Fricke
85c3cbf1a7SSilvio FrickeThe script appears as outside of the API as a pointer of the type::
86c3cbf1a7SSilvio Fricke
87c3cbf1a7SSilvio Fricke    struct assoc_array_edit;
88c3cbf1a7SSilvio Fricke
89c3cbf1a7SSilvio FrickeThere are two functions for dealing with the script:
90c3cbf1a7SSilvio Fricke
91c3cbf1a7SSilvio Fricke1. Apply an edit script::
92c3cbf1a7SSilvio Fricke
93c3cbf1a7SSilvio Fricke    void assoc_array_apply_edit(struct assoc_array_edit *edit);
94c3cbf1a7SSilvio Fricke
95c3cbf1a7SSilvio FrickeThis will perform the edit functions, interpolating various write barriers
96c3cbf1a7SSilvio Fricketo permit accesses under the RCU read lock to continue.  The edit script
97c3cbf1a7SSilvio Frickewill then be passed to ``call_rcu()`` to free it and any dead stuff it points
98c3cbf1a7SSilvio Fricketo.
99c3cbf1a7SSilvio Fricke
100c3cbf1a7SSilvio Fricke2. Cancel an edit script::
101c3cbf1a7SSilvio Fricke
102c3cbf1a7SSilvio Fricke    void assoc_array_cancel_edit(struct assoc_array_edit *edit);
103c3cbf1a7SSilvio Fricke
104c3cbf1a7SSilvio FrickeThis frees the edit script and all preallocated memory immediately. If
105c3cbf1a7SSilvio Frickethis was for insertion, the new object is _not_ released by this function,
106c3cbf1a7SSilvio Frickebut must rather be released by the caller.
107c3cbf1a7SSilvio Fricke
108c3cbf1a7SSilvio FrickeThese functions are guaranteed not to fail.
109c3cbf1a7SSilvio Fricke
110c3cbf1a7SSilvio Fricke
111c3cbf1a7SSilvio FrickeOperations Table
112c3cbf1a7SSilvio Fricke----------------
113c3cbf1a7SSilvio Fricke
114c3cbf1a7SSilvio FrickeVarious functions take a table of operations::
115c3cbf1a7SSilvio Fricke
116c3cbf1a7SSilvio Fricke    struct assoc_array_ops {
117c3cbf1a7SSilvio Fricke            ...
118c3cbf1a7SSilvio Fricke    };
119c3cbf1a7SSilvio Fricke
120c3cbf1a7SSilvio FrickeThis points to a number of methods, all of which need to be provided:
121c3cbf1a7SSilvio Fricke
122c3cbf1a7SSilvio Fricke1. Get a chunk of index key from caller data::
123c3cbf1a7SSilvio Fricke
124c3cbf1a7SSilvio Fricke    unsigned long (*get_key_chunk)(const void *index_key, int level);
125c3cbf1a7SSilvio Fricke
126c3cbf1a7SSilvio FrickeThis should return a chunk of caller-supplied index key starting at the
127c3cbf1a7SSilvio Fricke*bit* position given by the level argument.  The level argument will be a
128c3cbf1a7SSilvio Frickemultiple of ``ASSOC_ARRAY_KEY_CHUNK_SIZE`` and the function should return
129c3cbf1a7SSilvio Fricke``ASSOC_ARRAY_KEY_CHUNK_SIZE bits``.  No error is possible.
130c3cbf1a7SSilvio Fricke
131c3cbf1a7SSilvio Fricke
132c3cbf1a7SSilvio Fricke2. Get a chunk of an object's index key::
133c3cbf1a7SSilvio Fricke
134c3cbf1a7SSilvio Fricke    unsigned long (*get_object_key_chunk)(const void *object, int level);
135c3cbf1a7SSilvio Fricke
136c3cbf1a7SSilvio FrickeAs the previous function, but gets its data from an object in the array
137c3cbf1a7SSilvio Frickerather than from a caller-supplied index key.
138c3cbf1a7SSilvio Fricke
139c3cbf1a7SSilvio Fricke
140c3cbf1a7SSilvio Fricke3. See if this is the object we're looking for::
141c3cbf1a7SSilvio Fricke
142c3cbf1a7SSilvio Fricke    bool (*compare_object)(const void *object, const void *index_key);
143c3cbf1a7SSilvio Fricke
144c3cbf1a7SSilvio FrickeCompare the object against an index key and return ``true`` if it matches and
145c3cbf1a7SSilvio Fricke``false`` if it doesn't.
146c3cbf1a7SSilvio Fricke
147c3cbf1a7SSilvio Fricke
148c3cbf1a7SSilvio Fricke4. Diff the index keys of two objects::
149c3cbf1a7SSilvio Fricke
150c3cbf1a7SSilvio Fricke    int (*diff_objects)(const void *object, const void *index_key);
151c3cbf1a7SSilvio Fricke
152c3cbf1a7SSilvio FrickeReturn the bit position at which the index key of the specified object
153c3cbf1a7SSilvio Frickediffers from the given index key or -1 if they are the same.
154c3cbf1a7SSilvio Fricke
155c3cbf1a7SSilvio Fricke
156c3cbf1a7SSilvio Fricke5. Free an object::
157c3cbf1a7SSilvio Fricke
158c3cbf1a7SSilvio Fricke    void (*free_object)(void *object);
159c3cbf1a7SSilvio Fricke
160c3cbf1a7SSilvio FrickeFree the specified object.  Note that this may be called an RCU grace period
161c3cbf1a7SSilvio Frickeafter ``assoc_array_apply_edit()`` was called, so ``synchronize_rcu()`` may be
162c3cbf1a7SSilvio Frickenecessary on module unloading.
163c3cbf1a7SSilvio Fricke
164c3cbf1a7SSilvio Fricke
165c3cbf1a7SSilvio FrickeManipulation Functions
166c3cbf1a7SSilvio Fricke----------------------
167c3cbf1a7SSilvio Fricke
168c3cbf1a7SSilvio FrickeThere are a number of functions for manipulating an associative array:
169c3cbf1a7SSilvio Fricke
170c3cbf1a7SSilvio Fricke1. Initialise an associative array::
171c3cbf1a7SSilvio Fricke
172c3cbf1a7SSilvio Fricke    void assoc_array_init(struct assoc_array *array);
173c3cbf1a7SSilvio Fricke
174c3cbf1a7SSilvio FrickeThis initialises the base structure for an associative array.  It can't fail.
175c3cbf1a7SSilvio Fricke
176c3cbf1a7SSilvio Fricke
177c3cbf1a7SSilvio Fricke2. Insert/replace an object in an associative array::
178c3cbf1a7SSilvio Fricke
179c3cbf1a7SSilvio Fricke    struct assoc_array_edit *
180c3cbf1a7SSilvio Fricke    assoc_array_insert(struct assoc_array *array,
181c3cbf1a7SSilvio Fricke                       const struct assoc_array_ops *ops,
182c3cbf1a7SSilvio Fricke                       const void *index_key,
183c3cbf1a7SSilvio Fricke                       void *object);
184c3cbf1a7SSilvio Fricke
185c3cbf1a7SSilvio FrickeThis inserts the given object into the array.  Note that the least
186c3cbf1a7SSilvio Frickesignificant bit of the pointer must be zero as it's used to type-mark
187c3cbf1a7SSilvio Frickepointers internally.
188c3cbf1a7SSilvio Fricke
189c3cbf1a7SSilvio FrickeIf an object already exists for that key then it will be replaced with the
190c3cbf1a7SSilvio Frickenew object and the old one will be freed automatically.
191c3cbf1a7SSilvio Fricke
192c3cbf1a7SSilvio FrickeThe ``index_key`` argument should hold index key information and is
193c3cbf1a7SSilvio Frickepassed to the methods in the ops table when they are called.
194c3cbf1a7SSilvio Fricke
195c3cbf1a7SSilvio FrickeThis function makes no alteration to the array itself, but rather returns
196c3cbf1a7SSilvio Frickean edit script that must be applied.  ``-ENOMEM`` is returned in the case of
197c3cbf1a7SSilvio Frickean out-of-memory error.
198c3cbf1a7SSilvio Fricke
199c3cbf1a7SSilvio FrickeThe caller should lock exclusively against other modifiers of the array.
200c3cbf1a7SSilvio Fricke
201c3cbf1a7SSilvio Fricke
202c3cbf1a7SSilvio Fricke3. Delete an object from an associative array::
203c3cbf1a7SSilvio Fricke
204c3cbf1a7SSilvio Fricke    struct assoc_array_edit *
205c3cbf1a7SSilvio Fricke    assoc_array_delete(struct assoc_array *array,
206c3cbf1a7SSilvio Fricke                       const struct assoc_array_ops *ops,
207c3cbf1a7SSilvio Fricke                       const void *index_key);
208c3cbf1a7SSilvio Fricke
209c3cbf1a7SSilvio FrickeThis deletes an object that matches the specified data from the array.
210c3cbf1a7SSilvio Fricke
211c3cbf1a7SSilvio FrickeThe ``index_key`` argument should hold index key information and is
212c3cbf1a7SSilvio Frickepassed to the methods in the ops table when they are called.
213c3cbf1a7SSilvio Fricke
214c3cbf1a7SSilvio FrickeThis function makes no alteration to the array itself, but rather returns
215c3cbf1a7SSilvio Frickean edit script that must be applied.  ``-ENOMEM`` is returned in the case of
216c3cbf1a7SSilvio Frickean out-of-memory error.  ``NULL`` will be returned if the specified object is
217c3cbf1a7SSilvio Frickenot found within the array.
218c3cbf1a7SSilvio Fricke
219c3cbf1a7SSilvio FrickeThe caller should lock exclusively against other modifiers of the array.
220c3cbf1a7SSilvio Fricke
221c3cbf1a7SSilvio Fricke
222c3cbf1a7SSilvio Fricke4. Delete all objects from an associative array::
223c3cbf1a7SSilvio Fricke
224c3cbf1a7SSilvio Fricke    struct assoc_array_edit *
225c3cbf1a7SSilvio Fricke    assoc_array_clear(struct assoc_array *array,
226c3cbf1a7SSilvio Fricke                      const struct assoc_array_ops *ops);
227c3cbf1a7SSilvio Fricke
228c3cbf1a7SSilvio FrickeThis deletes all the objects from an associative array and leaves it
229c3cbf1a7SSilvio Frickecompletely empty.
230c3cbf1a7SSilvio Fricke
231c3cbf1a7SSilvio FrickeThis function makes no alteration to the array itself, but rather returns
232c3cbf1a7SSilvio Frickean edit script that must be applied.  ``-ENOMEM`` is returned in the case of
233c3cbf1a7SSilvio Frickean out-of-memory error.
234c3cbf1a7SSilvio Fricke
235c3cbf1a7SSilvio FrickeThe caller should lock exclusively against other modifiers of the array.
236c3cbf1a7SSilvio Fricke
237c3cbf1a7SSilvio Fricke
238c3cbf1a7SSilvio Fricke5. Destroy an associative array, deleting all objects::
239c3cbf1a7SSilvio Fricke
240c3cbf1a7SSilvio Fricke    void assoc_array_destroy(struct assoc_array *array,
241c3cbf1a7SSilvio Fricke                             const struct assoc_array_ops *ops);
242c3cbf1a7SSilvio Fricke
243c3cbf1a7SSilvio FrickeThis destroys the contents of the associative array and leaves it
244c3cbf1a7SSilvio Frickecompletely empty.  It is not permitted for another thread to be traversing
245c3cbf1a7SSilvio Frickethe array under the RCU read lock at the same time as this function is
246c3cbf1a7SSilvio Frickedestroying it as no RCU deferral is performed on memory release -
247c3cbf1a7SSilvio Frickesomething that would require memory to be allocated.
248c3cbf1a7SSilvio Fricke
249c3cbf1a7SSilvio FrickeThe caller should lock exclusively against other modifiers and accessors
250c3cbf1a7SSilvio Frickeof the array.
251c3cbf1a7SSilvio Fricke
252c3cbf1a7SSilvio Fricke
253c3cbf1a7SSilvio Fricke6. Garbage collect an associative array::
254c3cbf1a7SSilvio Fricke
255c3cbf1a7SSilvio Fricke    int assoc_array_gc(struct assoc_array *array,
256c3cbf1a7SSilvio Fricke                       const struct assoc_array_ops *ops,
257c3cbf1a7SSilvio Fricke                       bool (*iterator)(void *object, void *iterator_data),
258c3cbf1a7SSilvio Fricke                       void *iterator_data);
259c3cbf1a7SSilvio Fricke
260c3cbf1a7SSilvio FrickeThis iterates over the objects in an associative array and passes each one to
261c3cbf1a7SSilvio Fricke``iterator()``.  If ``iterator()`` returns ``true``, the object is kept.  If it
262c3cbf1a7SSilvio Frickereturns ``false``, the object will be freed.  If the ``iterator()`` function
263c3cbf1a7SSilvio Frickereturns ``true``, it must perform any appropriate refcount incrementing on the
264c3cbf1a7SSilvio Frickeobject before returning.
265c3cbf1a7SSilvio Fricke
266c3cbf1a7SSilvio FrickeThe internal tree will be packed down if possible as part of the iteration
267c3cbf1a7SSilvio Fricketo reduce the number of nodes in it.
268c3cbf1a7SSilvio Fricke
269c3cbf1a7SSilvio FrickeThe ``iterator_data`` is passed directly to ``iterator()`` and is otherwise
270c3cbf1a7SSilvio Frickeignored by the function.
271c3cbf1a7SSilvio Fricke
272c3cbf1a7SSilvio FrickeThe function will return ``0`` if successful and ``-ENOMEM`` if there wasn't
273c3cbf1a7SSilvio Frickeenough memory.
274c3cbf1a7SSilvio Fricke
275c3cbf1a7SSilvio FrickeIt is possible for other threads to iterate over or search the array under
276*806654a9SWill Deaconthe RCU read lock while this function is in progress.  The caller should
277c3cbf1a7SSilvio Frickelock exclusively against other modifiers of the array.
278c3cbf1a7SSilvio Fricke
279c3cbf1a7SSilvio Fricke
280c3cbf1a7SSilvio FrickeAccess Functions
281c3cbf1a7SSilvio Fricke----------------
282c3cbf1a7SSilvio Fricke
283c3cbf1a7SSilvio FrickeThere are two functions for accessing an associative array:
284c3cbf1a7SSilvio Fricke
285c3cbf1a7SSilvio Fricke1. Iterate over all the objects in an associative array::
286c3cbf1a7SSilvio Fricke
287c3cbf1a7SSilvio Fricke    int assoc_array_iterate(const struct assoc_array *array,
288c3cbf1a7SSilvio Fricke                            int (*iterator)(const void *object,
289c3cbf1a7SSilvio Fricke                                            void *iterator_data),
290c3cbf1a7SSilvio Fricke                            void *iterator_data);
291c3cbf1a7SSilvio Fricke
292c3cbf1a7SSilvio FrickeThis passes each object in the array to the iterator callback function.
293c3cbf1a7SSilvio Fricke``iterator_data`` is private data for that function.
294c3cbf1a7SSilvio Fricke
295c3cbf1a7SSilvio FrickeThis may be used on an array at the same time as the array is being
296c3cbf1a7SSilvio Frickemodified, provided the RCU read lock is held.  Under such circumstances,
297c3cbf1a7SSilvio Frickeit is possible for the iteration function to see some objects twice.  If
298c3cbf1a7SSilvio Frickethis is a problem, then modification should be locked against.  The
299c3cbf1a7SSilvio Frickeiteration algorithm should not, however, miss any objects.
300c3cbf1a7SSilvio Fricke
301c3cbf1a7SSilvio FrickeThe function will return ``0`` if no objects were in the array or else it will
302c3cbf1a7SSilvio Frickereturn the result of the last iterator function called.  Iteration stops
303c3cbf1a7SSilvio Frickeimmediately if any call to the iteration function results in a non-zero
304c3cbf1a7SSilvio Frickereturn.
305c3cbf1a7SSilvio Fricke
306c3cbf1a7SSilvio Fricke
307c3cbf1a7SSilvio Fricke2. Find an object in an associative array::
308c3cbf1a7SSilvio Fricke
309c3cbf1a7SSilvio Fricke    void *assoc_array_find(const struct assoc_array *array,
310c3cbf1a7SSilvio Fricke                           const struct assoc_array_ops *ops,
311c3cbf1a7SSilvio Fricke                           const void *index_key);
312c3cbf1a7SSilvio Fricke
313c3cbf1a7SSilvio FrickeThis walks through the array's internal tree directly to the object
314c3cbf1a7SSilvio Frickespecified by the index key..
315c3cbf1a7SSilvio Fricke
316c3cbf1a7SSilvio FrickeThis may be used on an array at the same time as the array is being
317c3cbf1a7SSilvio Frickemodified, provided the RCU read lock is held.
318c3cbf1a7SSilvio Fricke
319c3cbf1a7SSilvio FrickeThe function will return the object if found (and set ``*_type`` to the object
320c3cbf1a7SSilvio Fricketype) or will return ``NULL`` if the object was not found.
321c3cbf1a7SSilvio Fricke
322c3cbf1a7SSilvio Fricke
323c3cbf1a7SSilvio FrickeIndex Key Form
324c3cbf1a7SSilvio Fricke--------------
325c3cbf1a7SSilvio Fricke
326c3cbf1a7SSilvio FrickeThe index key can be of any form, but since the algorithms aren't told how long
327c3cbf1a7SSilvio Frickethe key is, it is strongly recommended that the index key includes its length
328c3cbf1a7SSilvio Frickevery early on before any variation due to the length would have an effect on
329c3cbf1a7SSilvio Frickecomparisons.
330c3cbf1a7SSilvio Fricke
331c3cbf1a7SSilvio FrickeThis will cause leaves with different length keys to scatter away from each
332c3cbf1a7SSilvio Frickeother - and those with the same length keys to cluster together.
333c3cbf1a7SSilvio Fricke
334c3cbf1a7SSilvio FrickeIt is also recommended that the index key begin with a hash of the rest of the
335c3cbf1a7SSilvio Frickekey to maximise scattering throughout keyspace.
336c3cbf1a7SSilvio Fricke
337c3cbf1a7SSilvio FrickeThe better the scattering, the wider and lower the internal tree will be.
338c3cbf1a7SSilvio Fricke
339c3cbf1a7SSilvio FrickePoor scattering isn't too much of a problem as there are shortcuts and nodes
340c3cbf1a7SSilvio Frickecan contain mixtures of leaves and metadata pointers.
341c3cbf1a7SSilvio Fricke
342c3cbf1a7SSilvio FrickeThe index key is read in chunks of machine word.  Each chunk is subdivided into
343c3cbf1a7SSilvio Frickeone nibble (4 bits) per level, so on a 32-bit CPU this is good for 8 levels and
344c3cbf1a7SSilvio Frickeon a 64-bit CPU, 16 levels.  Unless the scattering is really poor, it is
345c3cbf1a7SSilvio Frickeunlikely that more than one word of any particular index key will have to be
346c3cbf1a7SSilvio Frickeused.
347c3cbf1a7SSilvio Fricke
348c3cbf1a7SSilvio Fricke
349c3cbf1a7SSilvio FrickeInternal Workings
350c3cbf1a7SSilvio Fricke=================
351c3cbf1a7SSilvio Fricke
352c3cbf1a7SSilvio FrickeThe associative array data structure has an internal tree.  This tree is
353c3cbf1a7SSilvio Frickeconstructed of two types of metadata blocks: nodes and shortcuts.
354c3cbf1a7SSilvio Fricke
355c3cbf1a7SSilvio FrickeA node is an array of slots.  Each slot can contain one of four things:
356c3cbf1a7SSilvio Fricke
357c3cbf1a7SSilvio Fricke* A NULL pointer, indicating that the slot is empty.
358c3cbf1a7SSilvio Fricke* A pointer to an object (a leaf).
359c3cbf1a7SSilvio Fricke* A pointer to a node at the next level.
360c3cbf1a7SSilvio Fricke* A pointer to a shortcut.
361c3cbf1a7SSilvio Fricke
362c3cbf1a7SSilvio Fricke
363c3cbf1a7SSilvio FrickeBasic Internal Tree Layout
364c3cbf1a7SSilvio Fricke--------------------------
365c3cbf1a7SSilvio Fricke
366c3cbf1a7SSilvio FrickeIgnoring shortcuts for the moment, the nodes form a multilevel tree.  The index
367c3cbf1a7SSilvio Frickekey space is strictly subdivided by the nodes in the tree and nodes occur on
368c3cbf1a7SSilvio Frickefixed levels.  For example::
369c3cbf1a7SSilvio Fricke
370c3cbf1a7SSilvio Fricke Level: 0               1               2               3
371c3cbf1a7SSilvio Fricke        =============== =============== =============== ===============
372c3cbf1a7SSilvio Fricke                                                        NODE D
373c3cbf1a7SSilvio Fricke                        NODE B          NODE C  +------>+---+
374c3cbf1a7SSilvio Fricke                +------>+---+   +------>+---+   |       | 0 |
375c3cbf1a7SSilvio Fricke        NODE A  |       | 0 |   |       | 0 |   |       +---+
376c3cbf1a7SSilvio Fricke        +---+   |       +---+   |       +---+   |       :   :
377c3cbf1a7SSilvio Fricke        | 0 |   |       :   :   |       :   :   |       +---+
378c3cbf1a7SSilvio Fricke        +---+   |       +---+   |       +---+   |       | f |
379c3cbf1a7SSilvio Fricke        | 1 |---+       | 3 |---+       | 7 |---+       +---+
380c3cbf1a7SSilvio Fricke        +---+           +---+           +---+
381c3cbf1a7SSilvio Fricke        :   :           :   :           | 8 |---+
382c3cbf1a7SSilvio Fricke        +---+           +---+           +---+   |       NODE E
383c3cbf1a7SSilvio Fricke        | e |---+       | f |           :   :   +------>+---+
384c3cbf1a7SSilvio Fricke        +---+   |       +---+           +---+           | 0 |
385c3cbf1a7SSilvio Fricke        | f |   |                       | f |           +---+
386c3cbf1a7SSilvio Fricke        +---+   |                       +---+           :   :
387c3cbf1a7SSilvio Fricke                |       NODE F                          +---+
388c3cbf1a7SSilvio Fricke                +------>+---+                           | f |
389c3cbf1a7SSilvio Fricke                        | 0 |           NODE G          +---+
390c3cbf1a7SSilvio Fricke                        +---+   +------>+---+
391c3cbf1a7SSilvio Fricke                        :   :   |       | 0 |
392c3cbf1a7SSilvio Fricke                        +---+   |       +---+
393c3cbf1a7SSilvio Fricke                        | 6 |---+       :   :
394c3cbf1a7SSilvio Fricke                        +---+           +---+
395c3cbf1a7SSilvio Fricke                        :   :           | f |
396c3cbf1a7SSilvio Fricke                        +---+           +---+
397c3cbf1a7SSilvio Fricke                        | f |
398c3cbf1a7SSilvio Fricke                        +---+
399c3cbf1a7SSilvio Fricke
400c3cbf1a7SSilvio FrickeIn the above example, there are 7 nodes (A-G), each with 16 slots (0-f).
401c3cbf1a7SSilvio FrickeAssuming no other meta data nodes in the tree, the key space is divided
402c3cbf1a7SSilvio Frickethusly::
403c3cbf1a7SSilvio Fricke
404c3cbf1a7SSilvio Fricke    KEY PREFIX      NODE
405c3cbf1a7SSilvio Fricke    ==========      ====
406c3cbf1a7SSilvio Fricke    137*            D
407c3cbf1a7SSilvio Fricke    138*            E
408c3cbf1a7SSilvio Fricke    13[0-69-f]*     C
409c3cbf1a7SSilvio Fricke    1[0-24-f]*      B
410c3cbf1a7SSilvio Fricke    e6*             G
411c3cbf1a7SSilvio Fricke    e[0-57-f]*      F
412c3cbf1a7SSilvio Fricke    [02-df]*        A
413c3cbf1a7SSilvio Fricke
414c3cbf1a7SSilvio FrickeSo, for instance, keys with the following example index keys will be found in
415c3cbf1a7SSilvio Frickethe appropriate nodes::
416c3cbf1a7SSilvio Fricke
417c3cbf1a7SSilvio Fricke    INDEX KEY       PREFIX  NODE
418c3cbf1a7SSilvio Fricke    =============== ======= ====
419c3cbf1a7SSilvio Fricke    13694892892489  13      C
420c3cbf1a7SSilvio Fricke    13795289025897  137     D
421c3cbf1a7SSilvio Fricke    13889dde88793   138     E
422c3cbf1a7SSilvio Fricke    138bbb89003093  138     E
423c3cbf1a7SSilvio Fricke    1394879524789   12      C
424c3cbf1a7SSilvio Fricke    1458952489      1       B
425c3cbf1a7SSilvio Fricke    9431809de993ba  -       A
426c3cbf1a7SSilvio Fricke    b4542910809cd   -       A
427c3cbf1a7SSilvio Fricke    e5284310def98   e       F
428c3cbf1a7SSilvio Fricke    e68428974237    e6      G
429c3cbf1a7SSilvio Fricke    e7fffcbd443     e       F
430c3cbf1a7SSilvio Fricke    f3842239082     -       A
431c3cbf1a7SSilvio Fricke
432c3cbf1a7SSilvio FrickeTo save memory, if a node can hold all the leaves in its portion of keyspace,
433c3cbf1a7SSilvio Frickethen the node will have all those leaves in it and will not have any metadata
434c3cbf1a7SSilvio Frickepointers - even if some of those leaves would like to be in the same slot.
435c3cbf1a7SSilvio Fricke
436c3cbf1a7SSilvio FrickeA node can contain a heterogeneous mix of leaves and metadata pointers.
437c3cbf1a7SSilvio FrickeMetadata pointers must be in the slots that match their subdivisions of key
438c3cbf1a7SSilvio Frickespace.  The leaves can be in any slot not occupied by a metadata pointer.  It
439c3cbf1a7SSilvio Frickeis guaranteed that none of the leaves in a node will match a slot occupied by a
440c3cbf1a7SSilvio Frickemetadata pointer.  If the metadata pointer is there, any leaf whose key matches
441c3cbf1a7SSilvio Frickethe metadata key prefix must be in the subtree that the metadata pointer points
442c3cbf1a7SSilvio Fricketo.
443c3cbf1a7SSilvio Fricke
444c3cbf1a7SSilvio FrickeIn the above example list of index keys, node A will contain::
445c3cbf1a7SSilvio Fricke
446c3cbf1a7SSilvio Fricke    SLOT    CONTENT         INDEX KEY (PREFIX)
447c3cbf1a7SSilvio Fricke    ====    =============== ==================
448c3cbf1a7SSilvio Fricke    1       PTR TO NODE B   1*
449c3cbf1a7SSilvio Fricke    any     LEAF            9431809de993ba
450c3cbf1a7SSilvio Fricke    any     LEAF            b4542910809cd
451c3cbf1a7SSilvio Fricke    e       PTR TO NODE F   e*
452c3cbf1a7SSilvio Fricke    any     LEAF            f3842239082
453c3cbf1a7SSilvio Fricke
454c3cbf1a7SSilvio Frickeand node B::
455c3cbf1a7SSilvio Fricke
456c3cbf1a7SSilvio Fricke    3	PTR TO NODE C	13*
457c3cbf1a7SSilvio Fricke    any	LEAF		1458952489
458c3cbf1a7SSilvio Fricke
459c3cbf1a7SSilvio Fricke
460c3cbf1a7SSilvio FrickeShortcuts
461c3cbf1a7SSilvio Fricke---------
462c3cbf1a7SSilvio Fricke
463c3cbf1a7SSilvio FrickeShortcuts are metadata records that jump over a piece of keyspace.  A shortcut
464c3cbf1a7SSilvio Frickeis a replacement for a series of single-occupancy nodes ascending through the
465c3cbf1a7SSilvio Frickelevels.  Shortcuts exist to save memory and to speed up traversal.
466c3cbf1a7SSilvio Fricke
467c3cbf1a7SSilvio FrickeIt is possible for the root of the tree to be a shortcut - say, for example,
468c3cbf1a7SSilvio Frickethe tree contains at least 17 nodes all with key prefix ``1111``.  The
469c3cbf1a7SSilvio Frickeinsertion algorithm will insert a shortcut to skip over the ``1111`` keyspace
470c3cbf1a7SSilvio Frickein a single bound and get to the fourth level where these actually become
471c3cbf1a7SSilvio Frickedifferent.
472c3cbf1a7SSilvio Fricke
473c3cbf1a7SSilvio Fricke
474c3cbf1a7SSilvio FrickeSplitting And Collapsing Nodes
475c3cbf1a7SSilvio Fricke------------------------------
476c3cbf1a7SSilvio Fricke
477c3cbf1a7SSilvio FrickeEach node has a maximum capacity of 16 leaves and metadata pointers.  If the
478c3cbf1a7SSilvio Frickeinsertion algorithm finds that it is trying to insert a 17th object into a
479c3cbf1a7SSilvio Frickenode, that node will be split such that at least two leaves that have a common
480c3cbf1a7SSilvio Frickekey segment at that level end up in a separate node rooted on that slot for
481c3cbf1a7SSilvio Frickethat common key segment.
482c3cbf1a7SSilvio Fricke
483c3cbf1a7SSilvio FrickeIf the leaves in a full node and the leaf that is being inserted are
484c3cbf1a7SSilvio Frickesufficiently similar, then a shortcut will be inserted into the tree.
485c3cbf1a7SSilvio Fricke
486c3cbf1a7SSilvio FrickeWhen the number of objects in the subtree rooted at a node falls to 16 or
487c3cbf1a7SSilvio Frickefewer, then the subtree will be collapsed down to a single node - and this will
488c3cbf1a7SSilvio Frickeripple towards the root if possible.
489c3cbf1a7SSilvio Fricke
490c3cbf1a7SSilvio Fricke
491c3cbf1a7SSilvio FrickeNon-Recursive Iteration
492c3cbf1a7SSilvio Fricke-----------------------
493c3cbf1a7SSilvio Fricke
494c3cbf1a7SSilvio FrickeEach node and shortcut contains a back pointer to its parent and the number of
495c3cbf1a7SSilvio Frickeslot in that parent that points to it.  None-recursive iteration uses these to
496c3cbf1a7SSilvio Frickeproceed rootwards through the tree, going to the parent node, slot N + 1 to
497c3cbf1a7SSilvio Frickemake sure progress is made without the need for a stack.
498c3cbf1a7SSilvio Fricke
499c3cbf1a7SSilvio FrickeThe backpointers, however, make simultaneous alteration and iteration tricky.
500c3cbf1a7SSilvio Fricke
501c3cbf1a7SSilvio Fricke
502c3cbf1a7SSilvio FrickeSimultaneous Alteration And Iteration
503c3cbf1a7SSilvio Fricke-------------------------------------
504c3cbf1a7SSilvio Fricke
505c3cbf1a7SSilvio FrickeThere are a number of cases to consider:
506c3cbf1a7SSilvio Fricke
507c3cbf1a7SSilvio Fricke1. Simple insert/replace.  This involves simply replacing a NULL or old
508c3cbf1a7SSilvio Fricke   matching leaf pointer with the pointer to the new leaf after a barrier.
509c3cbf1a7SSilvio Fricke   The metadata blocks don't change otherwise.  An old leaf won't be freed
510c3cbf1a7SSilvio Fricke   until after the RCU grace period.
511c3cbf1a7SSilvio Fricke
512c3cbf1a7SSilvio Fricke2. Simple delete.  This involves just clearing an old matching leaf.  The
513c3cbf1a7SSilvio Fricke   metadata blocks don't change otherwise.  The old leaf won't be freed until
514c3cbf1a7SSilvio Fricke   after the RCU grace period.
515c3cbf1a7SSilvio Fricke
516c3cbf1a7SSilvio Fricke3. Insertion replacing part of a subtree that we haven't yet entered.  This
517c3cbf1a7SSilvio Fricke   may involve replacement of part of that subtree - but that won't affect
518c3cbf1a7SSilvio Fricke   the iteration as we won't have reached the pointer to it yet and the
519c3cbf1a7SSilvio Fricke   ancestry blocks are not replaced (the layout of those does not change).
520c3cbf1a7SSilvio Fricke
521c3cbf1a7SSilvio Fricke4. Insertion replacing nodes that we're actively processing.  This isn't a
522c3cbf1a7SSilvio Fricke   problem as we've passed the anchoring pointer and won't switch onto the
523c3cbf1a7SSilvio Fricke   new layout until we follow the back pointers - at which point we've
524c3cbf1a7SSilvio Fricke   already examined the leaves in the replaced node (we iterate over all the
525c3cbf1a7SSilvio Fricke   leaves in a node before following any of its metadata pointers).
526c3cbf1a7SSilvio Fricke
527c3cbf1a7SSilvio Fricke   We might, however, re-see some leaves that have been split out into a new
528c3cbf1a7SSilvio Fricke   branch that's in a slot further along than we were at.
529c3cbf1a7SSilvio Fricke
530c3cbf1a7SSilvio Fricke5. Insertion replacing nodes that we're processing a dependent branch of.
531c3cbf1a7SSilvio Fricke   This won't affect us until we follow the back pointers.  Similar to (4).
532c3cbf1a7SSilvio Fricke
533c3cbf1a7SSilvio Fricke6. Deletion collapsing a branch under us.  This doesn't affect us because the
534c3cbf1a7SSilvio Fricke   back pointers will get us back to the parent of the new node before we
535c3cbf1a7SSilvio Fricke   could see the new node.  The entire collapsed subtree is thrown away
536c3cbf1a7SSilvio Fricke   unchanged - and will still be rooted on the same slot, so we shouldn't
537c3cbf1a7SSilvio Fricke   process it a second time as we'll go back to slot + 1.
538c3cbf1a7SSilvio Fricke
539c3cbf1a7SSilvio Fricke.. note::
540c3cbf1a7SSilvio Fricke
541c3cbf1a7SSilvio Fricke   Under some circumstances, we need to simultaneously change the parent
542c3cbf1a7SSilvio Fricke   pointer and the parent slot pointer on a node (say, for example, we
543c3cbf1a7SSilvio Fricke   inserted another node before it and moved it up a level).  We cannot do
544c3cbf1a7SSilvio Fricke   this without locking against a read - so we have to replace that node too.
545c3cbf1a7SSilvio Fricke
546c3cbf1a7SSilvio Fricke   However, when we're changing a shortcut into a node this isn't a problem
547c3cbf1a7SSilvio Fricke   as shortcuts only have one slot and so the parent slot number isn't used
548c3cbf1a7SSilvio Fricke   when traversing backwards over one.  This means that it's okay to change
549c3cbf1a7SSilvio Fricke   the slot number first - provided suitable barriers are used to make sure
550c3cbf1a7SSilvio Fricke   the parent slot number is read after the back pointer.
551c3cbf1a7SSilvio Fricke
552c3cbf1a7SSilvio FrickeObsolete blocks and leaves are freed up after an RCU grace period has passed,
553c3cbf1a7SSilvio Frickeso as long as anyone doing walking or iteration holds the RCU read lock, the
554c3cbf1a7SSilvio Frickeold superstructure should not go away on them.
555