1acce22f5Sshess 2acce22f5Sshess1. FTS3 Tokenizers 3acce22f5Sshess 4acce22f5Sshess When creating a new full-text table, FTS3 allows the user to select 5acce22f5Sshess the text tokenizer implementation to be used when indexing text 6d597e08bSdanielk1977 by specifying a "tokenize" clause as part of the CREATE VIRTUAL TABLE 7acce22f5Sshess statement: 8acce22f5Sshess 9acce22f5Sshess CREATE VIRTUAL TABLE <table-name> USING fts3( 10d597e08bSdanielk1977 <columns ...> [, tokenize <tokenizer-name> [<tokenizer-args>]] 11acce22f5Sshess ); 12acce22f5Sshess 13acce22f5Sshess The built-in tokenizers (valid values to pass as <tokenizer name>) are 14ab322bd2Sdan "simple", "porter" and "unicode". 15acce22f5Sshess 16acce22f5Sshess <tokenizer-args> should consist of zero or more white-space separated 17acce22f5Sshess arguments to pass to the selected tokenizer implementation. The 18acce22f5Sshess interpretation of the arguments, if any, depends on the individual 19acce22f5Sshess tokenizer. 20acce22f5Sshess 21acce22f5Sshess2. Custom Tokenizers 22acce22f5Sshess 23acce22f5Sshess FTS3 allows users to provide custom tokenizer implementations. The 24acce22f5Sshess interface used to create a new tokenizer is defined and described in 25acce22f5Sshess the fts3_tokenizer.h source file. 26acce22f5Sshess 27acce22f5Sshess Registering a new FTS3 tokenizer is similar to registering a new 28acce22f5Sshess virtual table module with SQLite. The user passes a pointer to a 29acce22f5Sshess structure containing pointers to various callback functions that 30acce22f5Sshess make up the implementation of the new tokenizer type. For tokenizers, 31acce22f5Sshess the structure (defined in fts3_tokenizer.h) is called 32acce22f5Sshess "sqlite3_tokenizer_module". 33acce22f5Sshess 34acce22f5Sshess FTS3 does not expose a C-function that users call to register new 35acce22f5Sshess tokenizer types with a database handle. Instead, the pointer must 36acce22f5Sshess be encoded as an SQL blob value and passed to FTS3 through the SQL 37acce22f5Sshess engine by evaluating a special scalar function, "fts3_tokenizer()". 38acce22f5Sshess The fts3_tokenizer() function may be called with one or two arguments, 39acce22f5Sshess as follows: 40acce22f5Sshess 41acce22f5Sshess SELECT fts3_tokenizer(<tokenizer-name>); 42acce22f5Sshess SELECT fts3_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>); 43acce22f5Sshess 44acce22f5Sshess Where <tokenizer-name> is a string identifying the tokenizer and 45acce22f5Sshess <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module 46acce22f5Sshess structure encoded as an SQL blob. If the second argument is present, 47acce22f5Sshess it is registered as tokenizer <tokenizer-name> and a copy of it 48acce22f5Sshess returned. If only one argument is passed, a pointer to the tokenizer 49acce22f5Sshess implementation currently registered as <tokenizer-name> is returned, 50acce22f5Sshess encoded as a blob. Or, if no such tokenizer exists, an SQL exception 51acce22f5Sshess (error) is raised. 52acce22f5Sshess 53acce22f5Sshess SECURITY: If the fts3 extension is used in an environment where potentially 54acce22f5Sshess malicious users may execute arbitrary SQL (i.e. gears), they should be 55*f10c535fSdrh prevented from invoking the fts3_tokenizer() function. The 56*f10c535fSdrh fts3_tokenizer() function is disabled by default. It is only enabled 57*f10c535fSdrh by SQLITE_DBCONFIG_ENABLE_FTS3_TOKENIZER. Do not enable it in 58*f10c535fSdrh security sensitive environments. 59acce22f5Sshess 60acce22f5Sshess See "Sample code" below for an example of calling the fts3_tokenizer() 61acce22f5Sshess function from C code. 62acce22f5Sshess 63acce22f5Sshess3. ICU Library Tokenizers 64acce22f5Sshess 65acce22f5Sshess If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor 66acce22f5Sshess symbol defined, then there exists a built-in tokenizer named "icu" 67acce22f5Sshess implemented using the ICU library. The first argument passed to the 68acce22f5Sshess xCreate() method (see fts3_tokenizer.h) of this tokenizer may be 69acce22f5Sshess an ICU locale identifier. For example "tr_TR" for Turkish as used 70acce22f5Sshess in Turkey, or "en_AU" for English as used in Australia. For example: 71acce22f5Sshess 72acce22f5Sshess "CREATE VIRTUAL TABLE thai_text USING fts3(text, tokenizer icu th_TH)" 73acce22f5Sshess 74acce22f5Sshess The ICU tokenizer implementation is very simple. It splits the input 75acce22f5Sshess text according to the ICU rules for finding word boundaries and discards 76acce22f5Sshess any tokens that consist entirely of white-space. This may be suitable 77acce22f5Sshess for some applications in some locales, but not all. If more complex 78acce22f5Sshess processing is required, for example to implement stemming or 79acce22f5Sshess discard punctuation, this can be done by creating a tokenizer 8085b623f2Sdrh implementation that uses the ICU tokenizer as part of its implementation. 81acce22f5Sshess 82acce22f5Sshess When using the ICU tokenizer this way, it is safe to overwrite the 83acce22f5Sshess contents of the strings returned by the xNext() method (see 84acce22f5Sshess fts3_tokenizer.h). 85acce22f5Sshess 86acce22f5Sshess4. Sample code. 87acce22f5Sshess 88acce22f5Sshess The following two code samples illustrate the way C code should invoke 89acce22f5Sshess the fts3_tokenizer() scalar function: 90acce22f5Sshess 91acce22f5Sshess int registerTokenizer( 92acce22f5Sshess sqlite3 *db, 93acce22f5Sshess char *zName, 94acce22f5Sshess const sqlite3_tokenizer_module *p 95acce22f5Sshess ){ 96acce22f5Sshess int rc; 97acce22f5Sshess sqlite3_stmt *pStmt; 98acce22f5Sshess const char zSql[] = "SELECT fts3_tokenizer(?, ?)"; 99acce22f5Sshess 100acce22f5Sshess rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); 101acce22f5Sshess if( rc!=SQLITE_OK ){ 102acce22f5Sshess return rc; 103acce22f5Sshess } 104acce22f5Sshess 105acce22f5Sshess sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); 106acce22f5Sshess sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC); 107acce22f5Sshess sqlite3_step(pStmt); 108acce22f5Sshess 109acce22f5Sshess return sqlite3_finalize(pStmt); 110acce22f5Sshess } 111acce22f5Sshess 112acce22f5Sshess int queryTokenizer( 113acce22f5Sshess sqlite3 *db, 114acce22f5Sshess char *zName, 115acce22f5Sshess const sqlite3_tokenizer_module **pp 116acce22f5Sshess ){ 117acce22f5Sshess int rc; 118acce22f5Sshess sqlite3_stmt *pStmt; 119acce22f5Sshess const char zSql[] = "SELECT fts3_tokenizer(?)"; 120acce22f5Sshess 121acce22f5Sshess *pp = 0; 122acce22f5Sshess rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); 123acce22f5Sshess if( rc!=SQLITE_OK ){ 124acce22f5Sshess return rc; 125acce22f5Sshess } 126acce22f5Sshess 127acce22f5Sshess sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); 128acce22f5Sshess if( SQLITE_ROW==sqlite3_step(pStmt) ){ 129acce22f5Sshess if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){ 130acce22f5Sshess memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp)); 131acce22f5Sshess } 132acce22f5Sshess } 133acce22f5Sshess 134acce22f5Sshess return sqlite3_finalize(pStmt); 135acce22f5Sshess } 136