124e1afa2Sdanielk1977 224e1afa2Sdanielk19771. FTS2 Tokenizers 324e1afa2Sdanielk1977 424e1afa2Sdanielk1977 When creating a new full-text table, FTS2 allows the user to select 524e1afa2Sdanielk1977 the text tokenizer implementation to be used when indexing text 624e1afa2Sdanielk1977 by specifying a "tokenizer" clause as part of the CREATE VIRTUAL TABLE 724e1afa2Sdanielk1977 statement: 824e1afa2Sdanielk1977 924e1afa2Sdanielk1977 CREATE VIRTUAL TABLE <table-name> USING fts2( 1024e1afa2Sdanielk1977 <columns ...> [, tokenizer <tokenizer-name> [<tokenizer-args>]] 1124e1afa2Sdanielk1977 ); 1224e1afa2Sdanielk1977 1324e1afa2Sdanielk1977 The built-in tokenizers (valid values to pass as <tokenizer name>) are 1424e1afa2Sdanielk1977 "simple" and "porter". 1524e1afa2Sdanielk1977 1624e1afa2Sdanielk1977 <tokenizer-args> should consist of zero or more white-space separated 1724e1afa2Sdanielk1977 arguments to pass to the selected tokenizer implementation. The 1824e1afa2Sdanielk1977 interpretation of the arguments, if any, depends on the individual 1924e1afa2Sdanielk1977 tokenizer. 2024e1afa2Sdanielk1977 2124e1afa2Sdanielk19772. Custom Tokenizers 2224e1afa2Sdanielk1977 2324e1afa2Sdanielk1977 FTS2 allows users to provide custom tokenizer implementations. The 2424e1afa2Sdanielk1977 interface used to create a new tokenizer is defined and described in 2524e1afa2Sdanielk1977 the fts2_tokenizer.h source file. 2624e1afa2Sdanielk1977 2724e1afa2Sdanielk1977 Registering a new FTS2 tokenizer is similar to registering a new 2824e1afa2Sdanielk1977 virtual table module with SQLite. The user passes a pointer to a 2924e1afa2Sdanielk1977 structure containing pointers to various callback functions that 3024e1afa2Sdanielk1977 make up the implementation of the new tokenizer type. For tokenizers, 3124e1afa2Sdanielk1977 the structure (defined in fts2_tokenizer.h) is called 3224e1afa2Sdanielk1977 "sqlite3_tokenizer_module". 3324e1afa2Sdanielk1977 3424e1afa2Sdanielk1977 FTS2 does not expose a C-function that users call to register new 3524e1afa2Sdanielk1977 tokenizer types with a database handle. Instead, the pointer must 3624e1afa2Sdanielk1977 be encoded as an SQL blob value and passed to FTS2 through the SQL 3724e1afa2Sdanielk1977 engine by evaluating a special scalar function, "fts2_tokenizer()". 3824e1afa2Sdanielk1977 The fts2_tokenizer() function may be called with one or two arguments, 3924e1afa2Sdanielk1977 as follows: 4024e1afa2Sdanielk1977 4124e1afa2Sdanielk1977 SELECT fts2_tokenizer(<tokenizer-name>); 4224e1afa2Sdanielk1977 SELECT fts2_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>); 4324e1afa2Sdanielk1977 4424e1afa2Sdanielk1977 Where <tokenizer-name> is a string identifying the tokenizer and 4524e1afa2Sdanielk1977 <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module 4624e1afa2Sdanielk1977 structure encoded as an SQL blob. If the second argument is present, 4724e1afa2Sdanielk1977 it is registered as tokenizer <tokenizer-name> and a copy of it 4824e1afa2Sdanielk1977 returned. If only one argument is passed, a pointer to the tokenizer 4924e1afa2Sdanielk1977 implementation currently registered as <tokenizer-name> is returned, 5046760820Sdanielk1977 encoded as a blob. Or, if no such tokenizer exists, an SQL exception 5146760820Sdanielk1977 (error) is raised. 5224e1afa2Sdanielk1977 5324e1afa2Sdanielk1977 SECURITY: If the fts2 extension is used in an environment where potentially 5424e1afa2Sdanielk1977 malicious users may execute arbitrary SQL (i.e. gears), they should be 5524e1afa2Sdanielk1977 prevented from invoking the fts2_tokenizer() function, possibly using the 5624e1afa2Sdanielk1977 authorisation callback. 5724e1afa2Sdanielk1977 5824e1afa2Sdanielk1977 See "Sample code" below for an example of calling the fts2_tokenizer() 5924e1afa2Sdanielk1977 function from C code. 6024e1afa2Sdanielk1977 6124e1afa2Sdanielk19773. ICU Library Tokenizers 6224e1afa2Sdanielk1977 6324e1afa2Sdanielk1977 If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor 6424e1afa2Sdanielk1977 symbol defined, then there exists a built-in tokenizer named "icu" 6524e1afa2Sdanielk1977 implemented using the ICU library. The first argument passed to the 6624e1afa2Sdanielk1977 xCreate() method (see fts2_tokenizer.h) of this tokenizer may be 6724e1afa2Sdanielk1977 an ICU locale identifier. For example "tr_TR" for Turkish as used 6824e1afa2Sdanielk1977 in Turkey, or "en_AU" for English as used in Australia. For example: 6924e1afa2Sdanielk1977 7024e1afa2Sdanielk1977 "CREATE VIRTUAL TABLE thai_text USING fts2(text, tokenizer icu th_TH)" 7124e1afa2Sdanielk1977 7224e1afa2Sdanielk1977 The ICU tokenizer implementation is very simple. It splits the input 7324e1afa2Sdanielk1977 text according to the ICU rules for finding word boundaries and discards 7424e1afa2Sdanielk1977 any tokens that consist entirely of white-space. This may be suitable 7524e1afa2Sdanielk1977 for some applications in some locales, but not all. If more complex 7624e1afa2Sdanielk1977 processing is required, for example to implement stemming or 7724e1afa2Sdanielk1977 discard punctuation, this can be done by creating a tokenizer 78*85b623f2Sdrh implementation that uses the ICU tokenizer as part of its implementation. 7924e1afa2Sdanielk1977 8024e1afa2Sdanielk1977 When using the ICU tokenizer this way, it is safe to overwrite the 8124e1afa2Sdanielk1977 contents of the strings returned by the xNext() method (see 8224e1afa2Sdanielk1977 fts2_tokenizer.h). 8324e1afa2Sdanielk1977 8424e1afa2Sdanielk19774. Sample code. 8524e1afa2Sdanielk1977 8624e1afa2Sdanielk1977 The following two code samples illustrate the way C code should invoke 8724e1afa2Sdanielk1977 the fts2_tokenizer() scalar function: 8824e1afa2Sdanielk1977 8946760820Sdanielk1977 int registerTokenizer( 9046760820Sdanielk1977 sqlite3 *db, 9146760820Sdanielk1977 char *zName, 9246760820Sdanielk1977 const sqlite3_tokenizer_module *p 9346760820Sdanielk1977 ){ 9446760820Sdanielk1977 int rc; 9524e1afa2Sdanielk1977 sqlite3_stmt *pStmt; 9624e1afa2Sdanielk1977 const char zSql[] = "SELECT fts2_tokenizer(?, ?)"; 9724e1afa2Sdanielk1977 9846760820Sdanielk1977 rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); 9924e1afa2Sdanielk1977 if( rc!=SQLITE_OK ){ 10024e1afa2Sdanielk1977 return rc; 10124e1afa2Sdanielk1977 } 10224e1afa2Sdanielk1977 10324e1afa2Sdanielk1977 sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); 10424e1afa2Sdanielk1977 sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC); 10524e1afa2Sdanielk1977 sqlite3_step(pStmt); 10624e1afa2Sdanielk1977 10724e1afa2Sdanielk1977 return sqlite3_finalize(pStmt); 10824e1afa2Sdanielk1977 } 10924e1afa2Sdanielk1977 11046760820Sdanielk1977 int queryTokenizer( 11146760820Sdanielk1977 sqlite3 *db, 11246760820Sdanielk1977 char *zName, 11346760820Sdanielk1977 const sqlite3_tokenizer_module **pp 11446760820Sdanielk1977 ){ 11546760820Sdanielk1977 int rc; 11624e1afa2Sdanielk1977 sqlite3_stmt *pStmt; 11724e1afa2Sdanielk1977 const char zSql[] = "SELECT fts2_tokenizer(?)"; 11824e1afa2Sdanielk1977 11924e1afa2Sdanielk1977 *pp = 0; 12046760820Sdanielk1977 rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); 12124e1afa2Sdanielk1977 if( rc!=SQLITE_OK ){ 12224e1afa2Sdanielk1977 return rc; 12324e1afa2Sdanielk1977 } 12424e1afa2Sdanielk1977 12524e1afa2Sdanielk1977 sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); 12624e1afa2Sdanielk1977 if( SQLITE_ROW==sqlite3_step(pStmt) ){ 12724e1afa2Sdanielk1977 if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){ 12824e1afa2Sdanielk1977 memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp)); 12924e1afa2Sdanielk1977 } 13024e1afa2Sdanielk1977 } 13124e1afa2Sdanielk1977 13224e1afa2Sdanielk1977 return sqlite3_finalize(pStmt); 13324e1afa2Sdanielk1977 } 134