xref: /sqlite-3.40.0/ext/fts2/README.tokenizers (revision 85b623f2)
124e1afa2Sdanielk1977
224e1afa2Sdanielk19771. FTS2 Tokenizers
324e1afa2Sdanielk1977
424e1afa2Sdanielk1977  When creating a new full-text table, FTS2 allows the user to select
524e1afa2Sdanielk1977  the text tokenizer implementation to be used when indexing text
624e1afa2Sdanielk1977  by specifying a "tokenizer" clause as part of the CREATE VIRTUAL TABLE
724e1afa2Sdanielk1977  statement:
824e1afa2Sdanielk1977
924e1afa2Sdanielk1977    CREATE VIRTUAL TABLE <table-name> USING fts2(
1024e1afa2Sdanielk1977      <columns ...> [, tokenizer <tokenizer-name> [<tokenizer-args>]]
1124e1afa2Sdanielk1977    );
1224e1afa2Sdanielk1977
1324e1afa2Sdanielk1977  The built-in tokenizers (valid values to pass as <tokenizer name>) are
1424e1afa2Sdanielk1977  "simple" and "porter".
1524e1afa2Sdanielk1977
1624e1afa2Sdanielk1977  <tokenizer-args> should consist of zero or more white-space separated
1724e1afa2Sdanielk1977  arguments to pass to the selected tokenizer implementation. The
1824e1afa2Sdanielk1977  interpretation of the arguments, if any, depends on the individual
1924e1afa2Sdanielk1977  tokenizer.
2024e1afa2Sdanielk1977
2124e1afa2Sdanielk19772. Custom Tokenizers
2224e1afa2Sdanielk1977
2324e1afa2Sdanielk1977  FTS2 allows users to provide custom tokenizer implementations. The
2424e1afa2Sdanielk1977  interface used to create a new tokenizer is defined and described in
2524e1afa2Sdanielk1977  the fts2_tokenizer.h source file.
2624e1afa2Sdanielk1977
2724e1afa2Sdanielk1977  Registering a new FTS2 tokenizer is similar to registering a new
2824e1afa2Sdanielk1977  virtual table module with SQLite. The user passes a pointer to a
2924e1afa2Sdanielk1977  structure containing pointers to various callback functions that
3024e1afa2Sdanielk1977  make up the implementation of the new tokenizer type. For tokenizers,
3124e1afa2Sdanielk1977  the structure (defined in fts2_tokenizer.h) is called
3224e1afa2Sdanielk1977  "sqlite3_tokenizer_module".
3324e1afa2Sdanielk1977
3424e1afa2Sdanielk1977  FTS2 does not expose a C-function that users call to register new
3524e1afa2Sdanielk1977  tokenizer types with a database handle. Instead, the pointer must
3624e1afa2Sdanielk1977  be encoded as an SQL blob value and passed to FTS2 through the SQL
3724e1afa2Sdanielk1977  engine by evaluating a special scalar function, "fts2_tokenizer()".
3824e1afa2Sdanielk1977  The fts2_tokenizer() function may be called with one or two arguments,
3924e1afa2Sdanielk1977  as follows:
4024e1afa2Sdanielk1977
4124e1afa2Sdanielk1977    SELECT fts2_tokenizer(<tokenizer-name>);
4224e1afa2Sdanielk1977    SELECT fts2_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>);
4324e1afa2Sdanielk1977
4424e1afa2Sdanielk1977  Where <tokenizer-name> is a string identifying the tokenizer and
4524e1afa2Sdanielk1977  <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
4624e1afa2Sdanielk1977  structure encoded as an SQL blob. If the second argument is present,
4724e1afa2Sdanielk1977  it is registered as tokenizer <tokenizer-name> and a copy of it
4824e1afa2Sdanielk1977  returned. If only one argument is passed, a pointer to the tokenizer
4924e1afa2Sdanielk1977  implementation currently registered as <tokenizer-name> is returned,
5046760820Sdanielk1977  encoded as a blob. Or, if no such tokenizer exists, an SQL exception
5146760820Sdanielk1977  (error) is raised.
5224e1afa2Sdanielk1977
5324e1afa2Sdanielk1977  SECURITY: If the fts2 extension is used in an environment where potentially
5424e1afa2Sdanielk1977    malicious users may execute arbitrary SQL (i.e. gears), they should be
5524e1afa2Sdanielk1977    prevented from invoking the fts2_tokenizer() function, possibly using the
5624e1afa2Sdanielk1977    authorisation callback.
5724e1afa2Sdanielk1977
5824e1afa2Sdanielk1977  See "Sample code" below for an example of calling the fts2_tokenizer()
5924e1afa2Sdanielk1977  function from C code.
6024e1afa2Sdanielk1977
6124e1afa2Sdanielk19773. ICU Library Tokenizers
6224e1afa2Sdanielk1977
6324e1afa2Sdanielk1977  If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor
6424e1afa2Sdanielk1977  symbol defined, then there exists a built-in tokenizer named "icu"
6524e1afa2Sdanielk1977  implemented using the ICU library. The first argument passed to the
6624e1afa2Sdanielk1977  xCreate() method (see fts2_tokenizer.h) of this tokenizer may be
6724e1afa2Sdanielk1977  an ICU locale identifier. For example "tr_TR" for Turkish as used
6824e1afa2Sdanielk1977  in Turkey, or "en_AU" for English as used in Australia. For example:
6924e1afa2Sdanielk1977
7024e1afa2Sdanielk1977    "CREATE VIRTUAL TABLE thai_text USING fts2(text, tokenizer icu th_TH)"
7124e1afa2Sdanielk1977
7224e1afa2Sdanielk1977  The ICU tokenizer implementation is very simple. It splits the input
7324e1afa2Sdanielk1977  text according to the ICU rules for finding word boundaries and discards
7424e1afa2Sdanielk1977  any tokens that consist entirely of white-space. This may be suitable
7524e1afa2Sdanielk1977  for some applications in some locales, but not all. If more complex
7624e1afa2Sdanielk1977  processing is required, for example to implement stemming or
7724e1afa2Sdanielk1977  discard punctuation, this can be done by creating a tokenizer
78*85b623f2Sdrh  implementation that uses the ICU tokenizer as part of its implementation.
7924e1afa2Sdanielk1977
8024e1afa2Sdanielk1977  When using the ICU tokenizer this way, it is safe to overwrite the
8124e1afa2Sdanielk1977  contents of the strings returned by the xNext() method (see
8224e1afa2Sdanielk1977  fts2_tokenizer.h).
8324e1afa2Sdanielk1977
8424e1afa2Sdanielk19774. Sample code.
8524e1afa2Sdanielk1977
8624e1afa2Sdanielk1977  The following two code samples illustrate the way C code should invoke
8724e1afa2Sdanielk1977  the fts2_tokenizer() scalar function:
8824e1afa2Sdanielk1977
8946760820Sdanielk1977      int registerTokenizer(
9046760820Sdanielk1977        sqlite3 *db,
9146760820Sdanielk1977        char *zName,
9246760820Sdanielk1977        const sqlite3_tokenizer_module *p
9346760820Sdanielk1977      ){
9446760820Sdanielk1977        int rc;
9524e1afa2Sdanielk1977        sqlite3_stmt *pStmt;
9624e1afa2Sdanielk1977        const char zSql[] = "SELECT fts2_tokenizer(?, ?)";
9724e1afa2Sdanielk1977
9846760820Sdanielk1977        rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
9924e1afa2Sdanielk1977        if( rc!=SQLITE_OK ){
10024e1afa2Sdanielk1977          return rc;
10124e1afa2Sdanielk1977        }
10224e1afa2Sdanielk1977
10324e1afa2Sdanielk1977        sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
10424e1afa2Sdanielk1977        sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
10524e1afa2Sdanielk1977        sqlite3_step(pStmt);
10624e1afa2Sdanielk1977
10724e1afa2Sdanielk1977        return sqlite3_finalize(pStmt);
10824e1afa2Sdanielk1977      }
10924e1afa2Sdanielk1977
11046760820Sdanielk1977      int queryTokenizer(
11146760820Sdanielk1977        sqlite3 *db,
11246760820Sdanielk1977        char *zName,
11346760820Sdanielk1977        const sqlite3_tokenizer_module **pp
11446760820Sdanielk1977      ){
11546760820Sdanielk1977        int rc;
11624e1afa2Sdanielk1977        sqlite3_stmt *pStmt;
11724e1afa2Sdanielk1977        const char zSql[] = "SELECT fts2_tokenizer(?)";
11824e1afa2Sdanielk1977
11924e1afa2Sdanielk1977        *pp = 0;
12046760820Sdanielk1977        rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
12124e1afa2Sdanielk1977        if( rc!=SQLITE_OK ){
12224e1afa2Sdanielk1977          return rc;
12324e1afa2Sdanielk1977        }
12424e1afa2Sdanielk1977
12524e1afa2Sdanielk1977        sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
12624e1afa2Sdanielk1977        if( SQLITE_ROW==sqlite3_step(pStmt) ){
12724e1afa2Sdanielk1977          if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){
12824e1afa2Sdanielk1977            memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp));
12924e1afa2Sdanielk1977          }
13024e1afa2Sdanielk1977        }
13124e1afa2Sdanielk1977
13224e1afa2Sdanielk1977        return sqlite3_finalize(pStmt);
13324e1afa2Sdanielk1977      }
134