Product Documentation

Full-Text Search

Previous Topic

Next Topic

Full-Text Search Custom Tokenizer

In addition to the default tokenizers provided with FairCom DB FTS, support is available to call a DLL containing a custom tokenizer. FairCom Full-Text Search allows programmers to create their own full-text tokenizer and set it as the tokenizer to be used in a full-text index.

To use a custom tokenizer, the programmer must do the following:

  1. Provide a DLL that the server can load.
  2. When creating the full-text index, use the ctdbSetFTIOption(pFTI, CTDB_FTI_OPTION_CUSTLIB...) function to specify the library name (without the .DLL extension on Windows and without the lib on Unix).
  3. Use the ctdbSetFTIOption(pFTI, CTDB_FTI_OPTION_CUSTPARAM...) function to pass custom configuration parameters to the tokenizer.

The c-tree source code in the sdk\Xtras\ctree.samples\special\tokenizer directory contains an example of a custom tokenizer, easytok.c, and a stub for the tokenizer to be implemented by programmers in tokenizer.c. Both files have no dependency on any c-tree code and can be simply compiled as a DLL (for instance cl /LD easytok.c on Windows) or shared library on Unix and copied to a place where the server can load them.

See the complete FairCom Full-Text Search documentation for a list the functions that must be implemented.

Previous Topic

Next Topic

Tokenizer_init

Initialize the tokenizer. This function is called:

  1. at index creation to verify that the shared library/DLL can be properly open, the functions are resolved, and they are callable.
  2. before starting the text to be indexed tokenization
  3. when parsing search query in order to tokenize the request

Parameters:

  • texttype [IN] - the type of the text passed in: CTDB_FTI_MODE_REG, CTDB_FTI_MODE_UTF8, CTDB_FTI_MODE_UTF16
  • text [IN] - text to be tokenized. The memory it points to it is guaranteed to be valid until the Tokenizer_reset or the Tokenizer_end call.
    Notice that the text is passed only at init time so it is the implementer's responsibility to keep track of it and the "current position"
  • textsize [IN] - size (in bytes) of the text passed in
  • maxtokensize [IN] - size (in bytes) of the maximum token length the Index has been set to
  • param [IN] - tokenizer parameter string passed by the application.
  • errcode [OUT] - error code. (Guaranteed to be != NULL)

Returns:

Tokenize context handle that will be passed to the other functions.

NULL in case of error.

Usage:

DLLexport void* Tokenizer_init (unsigned long texttype, char* text, size_t textsize, long maxtokensize, char* param, int* errcode)

Previous Topic

Next Topic

Tokenizer_reset

Resets the text and its size for an already initialized tokenizer.

This function is used mainly during searches to tokenize the various search items of a search query.

Parameters:

  • handle [IN] - tokenizer context handle that needs the text to be refreshed.
  • text [IN] - text to be tokenized. The memory it points to it is guaranteed to be valid until the next Tokenizer_reset or the Tokenizer_end call.
  • textsize [IN] - size (in bytes) of the text passed in.

Returns:

CTDBRET_OK if successful, or the c-tree error code on failure.

Usage:

DLLexport int Tokenizer_reset (void *handle, char* text, size_t textsize)

Previous Topic

Next Topic

Tokenizer_next

Determines and returns the next token in the text.

Parameters:

  • handle [IN] - tokenizer context handle
  • size [OUT] - length in bytes of the returned token. (Guaranteed to be != NULL)
    In case of error, this needs to be set != 0.
    In case of end-of-text (no more tokens) needs to be set to 0.

Returns:

'\0' terminated string containing the next token, which needs to point to memory that needs to stay valid until the next tokenizer function call.

NULL in case of error (size != 0) or end-of-text (size == 0)

Usage:

DLLexport char *Tokenizer_next (void* handle, int *size)

Previous Topic

Next Topic

Tokenizer_end

Terminates the use of the tokenizer. It is the implementer's responsibility to release any resource it might have allocated.

Parameters:

  • handle [IN] - tokenizer context handle.

Usage:

DLLexport void Tokenizer_end (void* handle)

TOCIndex