xref: /sqlite-3.40.0/src/in-operator.md (revision e347d3e8)
1539f2fefSdrhIN-Operator Implementation Notes
2539f2fefSdrh================================
3539f2fefSdrh
4539f2fefSdrh## Definitions:
5539f2fefSdrh
6539f2fefSdrhAn IN operator has one of the following formats:
7539f2fefSdrh
8539f2fefSdrh>
9*e347d3e8Sdrh     x IN (y1,y2,y3,...,yN)
10539f2fefSdrh     x IN (subquery)
11539f2fefSdrh
12539f2fefSdrhThe "x" is referred to as the LHS (left-hand side).  The list or subquery
13539f2fefSdrhon the right is called the RHS (right-hand side).  If the RHS is a list
14539f2fefSdrhit must be a non-empty list.  But if the RHS is a subquery, it can be an
15539f2fefSdrhempty set.
16539f2fefSdrh
17*e347d3e8SdrhThe LHS can be a scalar (a single quantity) or a vector (a list of
18*e347d3e8Sdrhtwo or or more values) or a subquery that returns one or more columns.
19*e347d3e8SdrhWe use the term "vector" to mean an actually list of values or a
20*e347d3e8Sdrhsubquery that returns two or more columns.  An isolated value or
21*e347d3e8Sdrha subquery that returns a single columns is called a scalar.
22*e347d3e8Sdrh
23*e347d3e8SdrhThe RHS can be a subquery that returns a single column, a subquery
24*e347d3e8Sdrhthat returns two or more columns, or a list of scalars.  It is not
25*e347d3e8Sdrhcurrently support for the RHS to be a list of vectors.
26*e347d3e8Sdrh
27*e347d3e8SdrhThe number of columns for LHS must match the number of columns for
28*e347d3e8Sdrhthe RHS.  If the RHS is a list of values, then the LHS must be a
29*e347d3e8Sdrhscalar.  If the RHS is a subquery returning N columns, then the LHS
30*e347d3e8Sdrhmust be a vector of size N.
31539f2fefSdrh
32539f2fefSdrhNULL values can occur in either or both of the LHS and RHS.
33539f2fefSdrhIf the LHS contains only
34539f2fefSdrhNULL values then we say that it is a "total-NULL".  If the LHS contains
35539f2fefSdrhsome NULL values and some non-NULL values, then it is a "partial-NULL".
36539f2fefSdrhFor a scalar, there is no difference between a partial-NULL and a total-NULL.
37539f2fefSdrhThe RHS is a partial-NULL if any row contains a NULL value.  The RHS is
38539f2fefSdrha total-NULL if it contains one or more rows that contain only NULL values.
39539f2fefSdrhThe LHS is called "non-NULL" if it contains no NULL values.  The RHS is
40539f2fefSdrhcalled "non-NULL" if it contains no NULL values in any row.
41539f2fefSdrh
42539f2fefSdrhThe result of an IN operator is one of TRUE, FALSE, or NULL.  A NULL result
43539f2fefSdrhmeans that it cannot be determined if the LHS is contained in the RHS due
44539f2fefSdrhto the presence of NULL values.  In some contexts (for example, when the IN
45539f2fefSdrhoperator occurs in a WHERE clause)
46539f2fefSdrhthe system only needs a binary result: TRUE or NOT-TRUE.  One can also
47539f2fefSdrhto define a binary result of FALSE and NOT-FALSE, but
48539f2fefSdrhit turns out that no extra optimizations are possible in that case, so if
49539f2fefSdrhthe FALSE/NOT-FALSE binary is needed, we have to compute the three-state
50539f2fefSdrhTRUE/FALSE/NULL result and then combine the TRUE and NULL values into
51539f2fefSdrhNOT-FALSE.
52539f2fefSdrh
53539f2fefSdrhA "NOT IN" operator is computed by first computing the equivalent IN
54539f2fefSdrhoperator, then interchanging the TRUE and FALSE results.
55539f2fefSdrh
56539f2fefSdrh## Simple Full-Scan Algorithm
57539f2fefSdrh
58539f2fefSdrhThe following algorithm always compute the correct answer.  However, this
59539f2fefSdrhalgorithm is suboptimal, especially if there are many rows on the RHS.
60539f2fefSdrh
61539f2fefSdrh  1.  Set the null-flag to false
62539f2fefSdrh  2.  For each row in the RHS:
63539f2fefSdrh      <ol type='a'>
64539f2fefSdrh      <li>  Compare the LHS against the RHS
65539f2fefSdrh      <li>  If the LHS exactly matches the RHS, immediately return TRUE
66539f2fefSdrh      <li>  If the comparison result is NULL, set the null-flag to true
67539f2fefSdrh      </ol>
68539f2fefSdrh  3.  If the null-flag is true, return NULL.
69539f2fefSdrh  4.  Return FALSE
70539f2fefSdrh
71539f2fefSdrh## Optimized Algorithm
72539f2fefSdrh
73539f2fefSdrhThe following procedure computes the same answer as the simple full-scan
74539f2fefSdrhalgorithm, though it does so with less work in the common case.  This
751373c3a8Sdrhis the algorithm that is implemented in SQLite.
76539f2fefSdrh
77539f2fefSdrh  1.  If the RHS is a constant list of length 1 or 2, then rewrite the
78539f2fefSdrh      IN operator as a simple expression.  Implement
79539f2fefSdrh
80539f2fefSdrh            x IN (y1,y2)
81539f2fefSdrh
82539f2fefSdrh      as if it were
83539f2fefSdrh
84539f2fefSdrh            x=y1 OR x=y2
85539f2fefSdrh
86539f2fefSdrh      This is the INDEX_NOOP optimization and is only undertaken if the
87539f2fefSdrh      IN operator is used for membership testing.  If the IN operator is
88539f2fefSdrh      driving a loop, then skip this step entirely.
89539f2fefSdrh
901373c3a8Sdrh  2.  Check the LHS to see if it is a partial-NULL and if it is, jump
910a1082aeSdrh      ahead to step 5.
92539f2fefSdrh
930a1082aeSdrh  3.  Do a binary search of the RHS using the LHS as a probe.  If
941373c3a8Sdrh      an exact match is found, return TRUE.
95539f2fefSdrh
960a1082aeSdrh  4.  If the RHS is non-NULL then return FALSE.
97539f2fefSdrh
98*e347d3e8Sdrh  5.  If we do not need to distinguish between FALSE and NULL,
990a1082aeSdrh      then return FALSE.
100539f2fefSdrh
101539f2fefSdrh  6.  For each row in the RHS, compare that row against the LHS and
1021373c3a8Sdrh      if the result is NULL, immediately return NULL.  In the case
1031373c3a8Sdrh      of a scalar IN operator, we only need to look at the very first
1041373c3a8Sdrh      row the RHS because for a scalar RHS, all NULLs will always come
1051373c3a8Sdrh      first.  If the RHS is empty, this step is a no-op.
106539f2fefSdrh
107539f2fefSdrh  7.  Return FALSE.
108