xref: /sqlite-3.40.0/ext/fts3/README.tokenizers (revision f10c535f)
1acce22f5Sshess
2acce22f5Sshess1. FTS3 Tokenizers
3acce22f5Sshess
4acce22f5Sshess  When creating a new full-text table, FTS3 allows the user to select
5acce22f5Sshess  the text tokenizer implementation to be used when indexing text
6d597e08bSdanielk1977  by specifying a "tokenize" clause as part of the CREATE VIRTUAL TABLE
7acce22f5Sshess  statement:
8acce22f5Sshess
9acce22f5Sshess    CREATE VIRTUAL TABLE <table-name> USING fts3(
10d597e08bSdanielk1977      <columns ...> [, tokenize <tokenizer-name> [<tokenizer-args>]]
11acce22f5Sshess    );
12acce22f5Sshess
13acce22f5Sshess  The built-in tokenizers (valid values to pass as <tokenizer name>) are
14ab322bd2Sdan  "simple", "porter" and "unicode".
15acce22f5Sshess
16acce22f5Sshess  <tokenizer-args> should consist of zero or more white-space separated
17acce22f5Sshess  arguments to pass to the selected tokenizer implementation. The
18acce22f5Sshess  interpretation of the arguments, if any, depends on the individual
19acce22f5Sshess  tokenizer.
20acce22f5Sshess
21acce22f5Sshess2. Custom Tokenizers
22acce22f5Sshess
23acce22f5Sshess  FTS3 allows users to provide custom tokenizer implementations. The
24acce22f5Sshess  interface used to create a new tokenizer is defined and described in
25acce22f5Sshess  the fts3_tokenizer.h source file.
26acce22f5Sshess
27acce22f5Sshess  Registering a new FTS3 tokenizer is similar to registering a new
28acce22f5Sshess  virtual table module with SQLite. The user passes a pointer to a
29acce22f5Sshess  structure containing pointers to various callback functions that
30acce22f5Sshess  make up the implementation of the new tokenizer type. For tokenizers,
31acce22f5Sshess  the structure (defined in fts3_tokenizer.h) is called
32acce22f5Sshess  "sqlite3_tokenizer_module".
33acce22f5Sshess
34acce22f5Sshess  FTS3 does not expose a C-function that users call to register new
35acce22f5Sshess  tokenizer types with a database handle. Instead, the pointer must
36acce22f5Sshess  be encoded as an SQL blob value and passed to FTS3 through the SQL
37acce22f5Sshess  engine by evaluating a special scalar function, "fts3_tokenizer()".
38acce22f5Sshess  The fts3_tokenizer() function may be called with one or two arguments,
39acce22f5Sshess  as follows:
40acce22f5Sshess
41acce22f5Sshess    SELECT fts3_tokenizer(<tokenizer-name>);
42acce22f5Sshess    SELECT fts3_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>);
43acce22f5Sshess
44acce22f5Sshess  Where <tokenizer-name> is a string identifying the tokenizer and
45acce22f5Sshess  <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
46acce22f5Sshess  structure encoded as an SQL blob. If the second argument is present,
47acce22f5Sshess  it is registered as tokenizer <tokenizer-name> and a copy of it
48acce22f5Sshess  returned. If only one argument is passed, a pointer to the tokenizer
49acce22f5Sshess  implementation currently registered as <tokenizer-name> is returned,
50acce22f5Sshess  encoded as a blob. Or, if no such tokenizer exists, an SQL exception
51acce22f5Sshess  (error) is raised.
52acce22f5Sshess
53acce22f5Sshess  SECURITY: If the fts3 extension is used in an environment where potentially
54acce22f5Sshess    malicious users may execute arbitrary SQL (i.e. gears), they should be
55*f10c535fSdrh    prevented from invoking the fts3_tokenizer() function.  The
56*f10c535fSdrh    fts3_tokenizer() function is disabled by default. It is only enabled
57*f10c535fSdrh    by SQLITE_DBCONFIG_ENABLE_FTS3_TOKENIZER. Do not enable it in
58*f10c535fSdrh    security sensitive environments.
59acce22f5Sshess
60acce22f5Sshess  See "Sample code" below for an example of calling the fts3_tokenizer()
61acce22f5Sshess  function from C code.
62acce22f5Sshess
63acce22f5Sshess3. ICU Library Tokenizers
64acce22f5Sshess
65acce22f5Sshess  If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor
66acce22f5Sshess  symbol defined, then there exists a built-in tokenizer named "icu"
67acce22f5Sshess  implemented using the ICU library. The first argument passed to the
68acce22f5Sshess  xCreate() method (see fts3_tokenizer.h) of this tokenizer may be
69acce22f5Sshess  an ICU locale identifier. For example "tr_TR" for Turkish as used
70acce22f5Sshess  in Turkey, or "en_AU" for English as used in Australia. For example:
71acce22f5Sshess
72acce22f5Sshess    "CREATE VIRTUAL TABLE thai_text USING fts3(text, tokenizer icu th_TH)"
73acce22f5Sshess
74acce22f5Sshess  The ICU tokenizer implementation is very simple. It splits the input
75acce22f5Sshess  text according to the ICU rules for finding word boundaries and discards
76acce22f5Sshess  any tokens that consist entirely of white-space. This may be suitable
77acce22f5Sshess  for some applications in some locales, but not all. If more complex
78acce22f5Sshess  processing is required, for example to implement stemming or
79acce22f5Sshess  discard punctuation, this can be done by creating a tokenizer
8085b623f2Sdrh  implementation that uses the ICU tokenizer as part of its implementation.
81acce22f5Sshess
82acce22f5Sshess  When using the ICU tokenizer this way, it is safe to overwrite the
83acce22f5Sshess  contents of the strings returned by the xNext() method (see
84acce22f5Sshess  fts3_tokenizer.h).
85acce22f5Sshess
86acce22f5Sshess4. Sample code.
87acce22f5Sshess
88acce22f5Sshess  The following two code samples illustrate the way C code should invoke
89acce22f5Sshess  the fts3_tokenizer() scalar function:
90acce22f5Sshess
91acce22f5Sshess      int registerTokenizer(
92acce22f5Sshess        sqlite3 *db,
93acce22f5Sshess        char *zName,
94acce22f5Sshess        const sqlite3_tokenizer_module *p
95acce22f5Sshess      ){
96acce22f5Sshess        int rc;
97acce22f5Sshess        sqlite3_stmt *pStmt;
98acce22f5Sshess        const char zSql[] = "SELECT fts3_tokenizer(?, ?)";
99acce22f5Sshess
100acce22f5Sshess        rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
101acce22f5Sshess        if( rc!=SQLITE_OK ){
102acce22f5Sshess          return rc;
103acce22f5Sshess        }
104acce22f5Sshess
105acce22f5Sshess        sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
106acce22f5Sshess        sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
107acce22f5Sshess        sqlite3_step(pStmt);
108acce22f5Sshess
109acce22f5Sshess        return sqlite3_finalize(pStmt);
110acce22f5Sshess      }
111acce22f5Sshess
112acce22f5Sshess      int queryTokenizer(
113acce22f5Sshess        sqlite3 *db,
114acce22f5Sshess        char *zName,
115acce22f5Sshess        const sqlite3_tokenizer_module **pp
116acce22f5Sshess      ){
117acce22f5Sshess        int rc;
118acce22f5Sshess        sqlite3_stmt *pStmt;
119acce22f5Sshess        const char zSql[] = "SELECT fts3_tokenizer(?)";
120acce22f5Sshess
121acce22f5Sshess        *pp = 0;
122acce22f5Sshess        rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
123acce22f5Sshess        if( rc!=SQLITE_OK ){
124acce22f5Sshess          return rc;
125acce22f5Sshess        }
126acce22f5Sshess
127acce22f5Sshess        sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
128acce22f5Sshess        if( SQLITE_ROW==sqlite3_step(pStmt) ){
129acce22f5Sshess          if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){
130acce22f5Sshess            memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp));
131acce22f5Sshess          }
132acce22f5Sshess        }
133acce22f5Sshess
134acce22f5Sshess        return sqlite3_finalize(pStmt);
135acce22f5Sshess      }
136