1539f2fefSdrhIN-Operator Implementation Notes 2539f2fefSdrh================================ 3539f2fefSdrh 4539f2fefSdrh## Definitions: 5539f2fefSdrh 6539f2fefSdrhAn IN operator has one of the following formats: 7539f2fefSdrh 8539f2fefSdrh> 9*e347d3e8Sdrh x IN (y1,y2,y3,...,yN) 10539f2fefSdrh x IN (subquery) 11539f2fefSdrh 12539f2fefSdrhThe "x" is referred to as the LHS (left-hand side). The list or subquery 13539f2fefSdrhon the right is called the RHS (right-hand side). If the RHS is a list 14539f2fefSdrhit must be a non-empty list. But if the RHS is a subquery, it can be an 15539f2fefSdrhempty set. 16539f2fefSdrh 17*e347d3e8SdrhThe LHS can be a scalar (a single quantity) or a vector (a list of 18*e347d3e8Sdrhtwo or or more values) or a subquery that returns one or more columns. 19*e347d3e8SdrhWe use the term "vector" to mean an actually list of values or a 20*e347d3e8Sdrhsubquery that returns two or more columns. An isolated value or 21*e347d3e8Sdrha subquery that returns a single columns is called a scalar. 22*e347d3e8Sdrh 23*e347d3e8SdrhThe RHS can be a subquery that returns a single column, a subquery 24*e347d3e8Sdrhthat returns two or more columns, or a list of scalars. It is not 25*e347d3e8Sdrhcurrently support for the RHS to be a list of vectors. 26*e347d3e8Sdrh 27*e347d3e8SdrhThe number of columns for LHS must match the number of columns for 28*e347d3e8Sdrhthe RHS. If the RHS is a list of values, then the LHS must be a 29*e347d3e8Sdrhscalar. If the RHS is a subquery returning N columns, then the LHS 30*e347d3e8Sdrhmust be a vector of size N. 31539f2fefSdrh 32539f2fefSdrhNULL values can occur in either or both of the LHS and RHS. 33539f2fefSdrhIf the LHS contains only 34539f2fefSdrhNULL values then we say that it is a "total-NULL". If the LHS contains 35539f2fefSdrhsome NULL values and some non-NULL values, then it is a "partial-NULL". 36539f2fefSdrhFor a scalar, there is no difference between a partial-NULL and a total-NULL. 37539f2fefSdrhThe RHS is a partial-NULL if any row contains a NULL value. The RHS is 38539f2fefSdrha total-NULL if it contains one or more rows that contain only NULL values. 39539f2fefSdrhThe LHS is called "non-NULL" if it contains no NULL values. The RHS is 40539f2fefSdrhcalled "non-NULL" if it contains no NULL values in any row. 41539f2fefSdrh 42539f2fefSdrhThe result of an IN operator is one of TRUE, FALSE, or NULL. A NULL result 43539f2fefSdrhmeans that it cannot be determined if the LHS is contained in the RHS due 44539f2fefSdrhto the presence of NULL values. In some contexts (for example, when the IN 45539f2fefSdrhoperator occurs in a WHERE clause) 46539f2fefSdrhthe system only needs a binary result: TRUE or NOT-TRUE. One can also 47539f2fefSdrhto define a binary result of FALSE and NOT-FALSE, but 48539f2fefSdrhit turns out that no extra optimizations are possible in that case, so if 49539f2fefSdrhthe FALSE/NOT-FALSE binary is needed, we have to compute the three-state 50539f2fefSdrhTRUE/FALSE/NULL result and then combine the TRUE and NULL values into 51539f2fefSdrhNOT-FALSE. 52539f2fefSdrh 53539f2fefSdrhA "NOT IN" operator is computed by first computing the equivalent IN 54539f2fefSdrhoperator, then interchanging the TRUE and FALSE results. 55539f2fefSdrh 56539f2fefSdrh## Simple Full-Scan Algorithm 57539f2fefSdrh 58539f2fefSdrhThe following algorithm always compute the correct answer. However, this 59539f2fefSdrhalgorithm is suboptimal, especially if there are many rows on the RHS. 60539f2fefSdrh 61539f2fefSdrh 1. Set the null-flag to false 62539f2fefSdrh 2. For each row in the RHS: 63539f2fefSdrh <ol type='a'> 64539f2fefSdrh <li> Compare the LHS against the RHS 65539f2fefSdrh <li> If the LHS exactly matches the RHS, immediately return TRUE 66539f2fefSdrh <li> If the comparison result is NULL, set the null-flag to true 67539f2fefSdrh </ol> 68539f2fefSdrh 3. If the null-flag is true, return NULL. 69539f2fefSdrh 4. Return FALSE 70539f2fefSdrh 71539f2fefSdrh## Optimized Algorithm 72539f2fefSdrh 73539f2fefSdrhThe following procedure computes the same answer as the simple full-scan 74539f2fefSdrhalgorithm, though it does so with less work in the common case. This 751373c3a8Sdrhis the algorithm that is implemented in SQLite. 76539f2fefSdrh 77539f2fefSdrh 1. If the RHS is a constant list of length 1 or 2, then rewrite the 78539f2fefSdrh IN operator as a simple expression. Implement 79539f2fefSdrh 80539f2fefSdrh x IN (y1,y2) 81539f2fefSdrh 82539f2fefSdrh as if it were 83539f2fefSdrh 84539f2fefSdrh x=y1 OR x=y2 85539f2fefSdrh 86539f2fefSdrh This is the INDEX_NOOP optimization and is only undertaken if the 87539f2fefSdrh IN operator is used for membership testing. If the IN operator is 88539f2fefSdrh driving a loop, then skip this step entirely. 89539f2fefSdrh 901373c3a8Sdrh 2. Check the LHS to see if it is a partial-NULL and if it is, jump 910a1082aeSdrh ahead to step 5. 92539f2fefSdrh 930a1082aeSdrh 3. Do a binary search of the RHS using the LHS as a probe. If 941373c3a8Sdrh an exact match is found, return TRUE. 95539f2fefSdrh 960a1082aeSdrh 4. If the RHS is non-NULL then return FALSE. 97539f2fefSdrh 98*e347d3e8Sdrh 5. If we do not need to distinguish between FALSE and NULL, 990a1082aeSdrh then return FALSE. 100539f2fefSdrh 101539f2fefSdrh 6. For each row in the RHS, compare that row against the LHS and 1021373c3a8Sdrh if the result is NULL, immediately return NULL. In the case 1031373c3a8Sdrh of a scalar IN operator, we only need to look at the very first 1041373c3a8Sdrh row the RHS because for a scalar RHS, all NULLs will always come 1051373c3a8Sdrh first. If the RHS is empty, this step is a no-op. 106539f2fefSdrh 107539f2fefSdrh 7. Return FALSE. 108