SIMPLE tokenizer
Example SQL full-text searches using SIMPLE tokenizer
Abstract
example SQL full-text searches using SIMPLE tokenizer
The SIMPLE tokenizer identifies each word using whitespace and/or punctuation. Then regardless of case, the tokenizer finds all matching words in the search. tokenizer is case insensitive.
A search string identifies each word using whitespace and/or punctuation and may use double quotes to identify a word or a phrase. Phrases contain two or more words within double quotes and may include whitespace between words. Punctuation, such as .!>, is ignored in the search string when it is not a full-text search operator. Reserved search operators include - " ^ * ( ) NEAR NEAR/n AND OR NOT; However, * ( ) OR NOT are not supported and will return an error when used.
/* ===================================================================== * Full-text search demo using the SIMPLE tokenizer. * ===================================================================== */ /* Create the testsearch table. */ CREATE TABLE testsearch ( name VARCHAR(30) , description VARCHAR(500) ) STORAGE_ATTRIBUTES 'hotAlter; huge; rowid_fld; noRecbyt_idx;' ; /* Create full-text index on the description field of the testsearch table. Use the PORTER English word stem tokenizer to do case insensitive matching of common word forms */ CREATE FULLTEXT INDEX testsearch_description_simple_fts ON testsearch( description ) UPDATE_MODE ASYNC ; /* Insert a record into the testsearch table. */ INSERT INTO testsearch VALUES('Mike','Americans --love-- their __BASEball__ and <FOOTball>!') ; /* Look at the record inserted into the testsearch table. */ SELECT description FROM testsearch; /* ===================================================================== * The following queries find the inserted record * by searching for words and phrases in the description field: * 'Americans --love-- their __BASEball__ and <FOOTball>!'. * * The SIMPLE tokenizer reduces the text to * 'americans love their baseball and football'. * * The tokenizer identifies each word using whitespace and/or punctuation. * The tokenizer is case insensitive. * A search string identifies each word using whitespace and/or punctuation. * A search string may use double quotes to identify a word or a phrase. * A phrase contains two or more words within double quotes. * A search string may include any amount of whitespace between words. * A search string may include * Punctuation, such as .!>, is ignored in the search string * when it is not a full-text search operator. * Reserved search operators include - " ^ * ( ) NEAR NEAR/n AND OR NOT * The following search operators are unsupported: * ( ) OR NOT * The server returns an error when you use an unsupported search operator. * ===================================================================== */ /* WORD SEARCH: Find records in the testsearch table where the description column contains the word 'baseball' anywhere in the description field. */ SELECT description FROM testsearch WHERE description MATCH 'baseball'; /* CASE INSENSITIVE SEARCH: Find records in the testsearch table where the description column contains case-insensitive forms of the word 'americans', such as 'Americans' and 'AMERIcans' anywhere in the description field. */ SELECT description FROM testsearch WHERE description MATCH 'americans'; /* FIRST WORD SEARCH: Find records in the testsearch table where the description column contains the word 'americans' as the first word in the description field. */ SELECT description FROM testsearch WHERE description MATCH '^americans'; /* FIRST PHRASE SEARCH: Find records in the testsearch table where the description column contains the phrase "americans love" at the begining of the text in the description field. */ SELECT description FROM testsearch WHERE description MATCH ' ^"americans love" '; /* MULTI-WORD SEARCH: Find records in the testsearch table where the description column contains the words 'baseball' and 'love' in any order anywhere in the description field. */ SELECT description FROM testsearch WHERE description MATCH 'baseball love'; /* MULTI-WORD SEARCH ACROSS WHITESPACE: Find records in the testsearch table where the description column contains the words 'baseball' and 'love' in any order anywhere in the description field. NOTE: Any amount of whitespace may occur between these words in the search text. Punctuation, such as .!>, is also ignored in the search string when it is not a full-text search operator. */ SELECT description FROM testsearch WHERE description MATCH ' <Baseball> >>> .LOVE. !!! '; /* PHRASE SEARCH: Find records in the testsearch table where the description column contains the case-insensitive phrase "love their" anywhere in the description field. The words 'love" and 'their" must follow each other with no intervening words in the description field. */ SELECT description FROM testsearch WHERE description MATCH ' "love their" '; /* PHRASE SEARCH ACROSS WHITESPACE: Find records in the testsearch table where the description column contains the case-insensitive phrase "LOVE their" anywhere in the description field. The case-insensitive words 'LOVE" and 'theIR" must follow each other with no intervening words, but any amount of whitespace and unreserved punctuation may occur between these words in the description field. */ SELECT description FROM testsearch WHERE description MATCH ' "LOVE! theIR" '; /* NEAR SEARCH: Find records in the testsearch table where the description column contains the words 'Americans' and 'baseball' with no more than ten words between each other anywhere in the description field. */ SELECT description FROM testsearch WHERE description MATCH 'Americans NEAR football '; /* NEAR PHRASE SEARCH: Find records in the testsearch table where the description column contains the words 'their' and 'baseballs' with no words between each other. This is the same as using a phrase search. */ SELECT description FROM testsearch WHERE description MATCH 'their NEAR/0 baseball'; /* NEAR N SEARCH: Find records in the testsearch table where the description column contains the words 'love' and 'baseball' with no more than 1 word between each other. This pattern may occur anywhere in the description field with any amount of whitespace and punctuation between the words. */ SELECT description FROM testsearch WHERE description MATCH ' baseball NEAR/1 football '; /* NOT WORD SEARCH: Find records in the testsearch table where the description column contains the words 'love' and 'baseball' but not 'karate'. NOTE: The minus - operator is the NOT operator. It must be adjacent to the word with no intervening whitespace. You must also include at least one term in the search phrase before the - operator. */ SELECT description FROM testsearch WHERE description MATCH 'love baseball -karate'; /* NOT PHRASE SEARCH: Find records in the testsearch table where the description column contains the word 'baseball' but not the phrase "hate karate" and not the word 'soccer'. NOTE: The minus operator is the NOT operator. It must be adjacent to the word with no intervening whitespace. You must also include a term in the search phrase before the NOT operator. */ SELECT description FROM testsearch WHERE description MATCH 'baseball -"hate karate" -soccer'; /* AND SEARCH: Find records in the testsearch table where the description column contains the word 'baseball' and 'loved'. NOTE: The AND operator is optional. The server automatically adds it between each term in the search expression. The following two queries are functionally identical. */ SELECT description FROM testsearch WHERE description MATCH ' love AND "their baseball" '; SELECT description FROM testsearch WHERE description MATCH ' love "their baseball" '; /* SEARCH for reserved words "and", "or", and "near": Find records in the testsearch table where the description column contains the words "baseball" and "and" but not "or" and "near". NOTE: enclose a reserved word in double quotes to treat it as a word instead of an operator. */ SELECT description FROM testsearch WHERE description MATCH ' love "and" -"or" -"near" '; /* ===================================================================== * The following queries do NOT find the inserted record * because they do not match the text: * 'Americans --love-- their __BASEball__ and <FOOTball>!' * ===================================================================== */ /* VERB STEM SEARCH: This query returns no results because the SIMPLE tokenizer does not match common English conjugations of verbs, such as 'loving', 'love', 'loved', and 'loves'. The PORTER tokenizer does. */ SELECT description FROM testsearch WHERE description MATCH 'loving'; /* NOUN STEM SEARCH: This query returns no results because the SIMPLE tokenizer does not match common English singular and plural forms of nouns, such as 'american' and 'americans'. The PORTER tokenizer does. */ SELECT description FROM testsearch WHERE description MATCH 'american'; /* FIRST WORD SEARCH: This query returns no results because 'love' is not the first word in the description. */ SELECT description FROM testsearch WHERE description MATCH '^love'; /* STEMMED MULTI-WORD SEARCH: This query returns no results because the SIMPLE tokenizer does not match word stems like the PORTER tokenizer does. */ SELECT description FROM testsearch WHERE description MATCH 'Baseballs american'; /* STEMMED NEAR SEARCH: This query returns no results because the SIMPLE tokenizer does not match word stems like the PORTER tokenizer does. */ SELECT description FROM testsearch WHERE description MATCH 'American NEAR baseballs '; /* STEMMED PHRASE SEARCH: This query returns no results because the SIMPLE tokenizer does not match word stems like the PORTER tokenizer does. */ SELECT description FROM testsearch WHERE description MATCH 'their NEAR/0 baseballs '; /* STEMMED NEAR N SEARCH: This query returns no results because the SIMPLE tokenizer does not match word stems like the PORTER tokenizer does. */ SELECT description FROM testsearch WHERE description MATCH 'loves NEAR/1 baseballs '; /* Remove the testsearch table. */ DROP TABLE testsearch;