7.8. Tokenizers

7.8.1. Summary

Groonga has tokenizer module that tokenizes text. It is used when the following cases:

  • Indexing text

    ../_images/used-when-indexing.png

    Tokenizer is used when indexing text.

  • Searching by query

    ../_images/used-when-searching.png

    Tokenizer is used when searching by query.

Tokenizer is an important module for full-text search. You can change trade-off between precision and recall by changing tokenizer.

Normally, TokenBigram is a suitable tokenizer. If you don’t know much about tokenizer, it’s recommended that you choose TokenBigram.

You can try a tokenizer by tokenize and table_tokenize. Here is an example to try TokenBigram tokenizer by tokenize:

Execution example:

tokenize TokenBigram "Hello World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "He"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "el"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ll"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "lo"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "o "
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": " W"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "Wo"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "or"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "rl"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "ld"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "d"
#     }
#   ]
# ]

7.8.2. What is “tokenize”?

“tokenize” is the process that extracts zero or more tokens from a text. There are some “tokenize” methods.

For example, Hello World is tokenized to the following tokens by bigram tokenize method:

  • He
  • el
  • ll
  • lo
  • o_ (_ means a white-space)
  • _W (_ means a white-space)
  • Wo
  • or
  • rl
  • ld

In the above example, 10 tokens are extracted from one text Hello World.

For example, Hello World is tokenized to the following tokens by white-space-separate tokenize method:

  • Hello
  • World

In the above example, 2 tokens are extracted from one text Hello World.

Token is used as search key. You can find indexed documents only by tokens that are extracted by used tokenize method. For example, you can find Hello World by ll with bigram tokenize method but you can’t find Hello World by ll with white-space-separate tokenize method. Because white-space-separate tokenize method doesn’t extract ll token. It just extracts Hello and World tokens.

In general, tokenize method that generates small tokens increases recall but decreases precision. Tokenize method that generates large tokens increases precision but decreases recall.

For example, we can find Hello World and A or B by or with bigram tokenize method. Hello World is a noise for people who wants to search “logical and”. It means that precision is decreased. But recall is increased.

We can find only A or B by or with white-space-separate tokenize method. Because World is tokenized to one token World with white-space-separate tokenize method. It means that precision is increased for people who wants to search “logical and”. But recall is decreased because Hello World that contains or isn’t found.

7.8.3. Built-in tokenizsers

Here is a list of built-in tokenizers:

  • TokenBigram
  • TokenBigramSplitSymbol
  • TokenBigramSplitSymbolAlpha
  • TokenBigramSplitSymbolAlphaDigit
  • TokenBigramIgnoreBlank
  • TokenBigramIgnoreBlankSplitSymbol
  • TokenBigramIgnoreBlankSplitSymbolAlpha
  • TokenBigramIgnoreBlankSplitSymbolAlphaDigit
  • TokenUnigram
  • TokenTrigram
  • TokenDelimit
  • TokenDelimitNull
  • TokenMecab
  • TokenRegexp

7.8.3.1. TokenBigram

TokenBigram is a bigram based tokenizer. It’s recommended to use this tokenizer for most cases.

Bigram tokenize method tokenizes a text to two adjacent characters tokens. For example, Hello is tokenized to the following tokens:

  • He
  • el
  • ll
  • lo

Bigram tokenize method is good for recall because you can find all texts by query consists of two or more characters.

In general, you can’t find all texts by query consists of one character because one character token doesn’t exist. But you can find all texts by query consists of one character in Groonga. Because Groonga find tokens that start with query by predictive search. For example, Groonga can find ll and lo tokens by l query.

Bigram tokenize method isn’t good for precision because you can find texts that includes query in word. For example, you can find world by or. This is more sensitive for ASCII only languages rather than non-ASCII languages. TokenBigram has solution for this problem described in the below.

TokenBigram behavior is different when it’s worked with any Normalizers.

If no normalizer is used, TokenBigram uses pure bigram (all tokens except the last token have two characters) tokenize method:

Execution example:

tokenize TokenBigram "Hello World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "He"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "el"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ll"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "lo"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "o "
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": " W"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "Wo"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "or"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "rl"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "ld"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "d"
#     }
#   ]
# ]

If normalizer is used, TokenBigram uses white-space-separate like tokenize method for ASCII characters. TokenBigram uses bigram tokenize method for non-ASCII characters.

You may be confused with this combined behavior. But it’s reasonable for most use cases such as English text (only ASCII characters) and Japanese text (ASCII and non-ASCII characters are mixed).

Most languages consists of only ASCII characters use white-space for word separator. White-space-separate tokenize method is suitable for the case.

Languages consists of non-ASCII characters don’t use white-space for word separator. Bigram tokenize method is suitable for the case.

Mixed tokenize method is suitable for mixed language case.

If you want to use bigram tokenize method for ASCII character, see TokenBigramSplitXXX type tokenizers such as TokenBigramSplitSymbolAlpha.

Let’s confirm TokenBigram behavior by example.

TokenBigram uses one or more white-spaces as token delimiter for ASCII characters:

Execution example:

tokenize TokenBigram "Hello World" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "hello"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "world"
#     }
#   ]
# ]

TokenBigram uses character type change as token delimiter for ASCII characters. Character type is one of them:

  • Alphabet
  • Digit
  • Symbol (such as (, ) and !)
  • Hiragana
  • Katakana
  • Kanji
  • Others

The following example shows two token delimiters:

  • at between 100 (digits) and cents (alphabets)
  • at between cents (alphabets) and !!! (symbols)

Execution example:

tokenize TokenBigram "100cents!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "100"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "cents"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "!!!"
#     }
#   ]
# ]

Here is an example that TokenBigram uses bigram tokenize method for non-ASCII characters.

Execution example:

tokenize TokenBigram "日本語の勉強" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "語の"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "の勉"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "勉強"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "強"
#     }
#   ]
# ]

7.8.3.2. TokenBigramSplitSymbol

TokenBigramSplitSymbol is similar to TokenBigram. The difference between them is symbol handling. TokenBigramSplitSymbol tokenizes symbols by bigram tokenize method:

Execution example:

tokenize TokenBigramSplitSymbol "100cents!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "100"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "cents"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

7.8.3.3. TokenBigramSplitSymbolAlpha

TokenBigramSplitSymbolAlpha is similar to TokenBigram. The difference between them is symbol and alphabet handling. TokenBigramSplitSymbolAlpha tokenizes symbols and alphabets by bigram tokenize method:

Execution example:

tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "100"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "ce"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "en"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "nt"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "ts"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "s!"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

7.8.3.4. TokenBigramSplitSymbolAlphaDigit

TokenBigramSplitSymbolAlphaDigit is similar to TokenBigram. The difference between them is symbol, alphabet and digit handling. TokenBigramSplitSymbolAlphaDigit tokenizes symbols, alphabets and digits by bigram tokenize method. It means that all characters are tokenized by bigram tokenize method:

Execution example:

tokenize TokenBigramSplitSymbolAlphaDigit "100cents!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "10"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "00"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "0c"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "ce"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "en"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "nt"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "ts"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "s!"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

7.8.3.5. TokenBigramIgnoreBlank

TokenBigramIgnoreBlank is similar to TokenBigram. The difference between them is blank handling. TokenBigramIgnoreBlank ignores white-spaces in continuous symbols and non-ASCII characters.

You can find difference of them by ! ! ! text because it has symbols and non-ASCII characters.

Here is a result by TokenBigram :

Execution example:

tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "日"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "本"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

Here is a result by TokenBigramIgnoreBlank:

Execution example:

tokenize TokenBigramIgnoreBlank "日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "!!!"
#     }
#   ]
# ]

7.8.3.6. TokenBigramIgnoreBlankSplitSymbol

TokenBigramIgnoreBlankSplitSymbol is similar to TokenBigram. The differences between them are the followings:

  • Blank handling
  • Symbol handling

TokenBigramIgnoreBlankSplitSymbol ignores white-spaces in continuous symbols and non-ASCII characters.

TokenBigramIgnoreBlankSplitSymbol tokenizes symbols by bigram tokenize method.

You can find difference of them by ! ! ! text because it has symbols and non-ASCII characters.

Here is a result by TokenBigram :

Execution example:

tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "日"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "本"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

Here is a result by TokenBigramIgnoreBlankSplitSymbol:

Execution example:

tokenize TokenBigramIgnoreBlankSplitSymbol "日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "語!"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

7.8.3.7. TokenBigramIgnoreBlankSplitSymbolAlpha

TokenBigramIgnoreBlankSplitSymbolAlpha is similar to TokenBigram. The differences between them are the followings:

  • Blank handling
  • Symbol and alphabet handling

TokenBigramIgnoreBlankSplitSymbolAlpha ignores white-spaces in continuous symbols and non-ASCII characters.

TokenBigramIgnoreBlankSplitSymbolAlpha tokenizes symbols and alphabets by bigram tokenize method.

You can find difference of them by Hello ! ! ! text because it has symbols and non-ASCII characters with white spaces and alphabets.

Here is a result by TokenBigram :

Execution example:

tokenize TokenBigram "Hello 日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "hello"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "日"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "本"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

Here is a result by TokenBigramIgnoreBlankSplitSymbolAlpha:

Execution example:

tokenize TokenBigramIgnoreBlankSplitSymbolAlpha "Hello 日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "he"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "el"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ll"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "lo"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "o日"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "語!"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

7.8.3.8. TokenBigramIgnoreBlankSplitSymbolAlphaDigit

TokenBigramIgnoreBlankSplitSymbolAlphaDigit is similar to TokenBigram. The differences between them are the followings:

  • Blank handling
  • Symbol, alphabet and digit handling

TokenBigramIgnoreBlankSplitSymbolAlphaDigit ignores white-spaces in continuous symbols and non-ASCII characters.

TokenBigramIgnoreBlankSplitSymbolAlphaDigit tokenizes symbols, alphabets and digits by bigram tokenize method. It means that all characters are tokenized by bigram tokenize method.

You can find difference of them by Hello ! ! ! 777 text because it has symbols and non-ASCII characters with white spaces, alphabets and digits.

Here is a result by TokenBigram :

Execution example:

tokenize TokenBigram "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "hello"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "日"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "本"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "777"
#     }
#   ]
# ]

Here is a result by TokenBigramIgnoreBlankSplitSymbolAlphaDigit:

Execution example:

tokenize TokenBigramIgnoreBlankSplitSymbolAlphaDigit "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "he"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "el"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ll"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "lo"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "o日"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "語!"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "!7"
#     },
#     {
#       "position": 11,
#       "force_prefix": false,
#       "value": "77"
#     },
#     {
#       "position": 12,
#       "force_prefix": false,
#       "value": "77"
#     },
#     {
#       "position": 13,
#       "force_prefix": false,
#       "value": "7"
#     }
#   ]
# ]

7.8.3.9. TokenUnigram

TokenUnigram is similar to TokenBigram. The differences between them is token unit. TokenBigram uses 2 characters per token. TokenUnigram uses 1 character per token.

Execution example:

tokenize TokenUnigram "100cents!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "100"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "cents"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "!!!"
#     }
#   ]
# ]

7.8.3.10. TokenTrigram

TokenTrigram is similar to TokenBigram. The differences between them is token unit. TokenBigram uses 2 characters per token. TokenTrigram uses 3 characters per token.

Execution example:

tokenize TokenTrigram "10000cents!!!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "10000"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "cents"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "!!!!!"
#     }
#   ]
# ]

7.8.3.11. TokenDelimit

TokenDelimit extracts token by splitting one or more space characters (U+0020). For example, Hello World is tokenized to Hello and World.

TokenDelimit is suitable for tag text. You can extract groonga and full-text-search and http as tags from groonga full-text-search http.

Here is an example of TokenDelimit:

Execution example:

tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "groonga"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "full-text-search"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "http"
#     }
#   ]
# ]

TokenDelimit can also specify options. TokenDelimit has delimiter option and pattern option.

delimiter option can split token with a specified characters.

For example, Hello,World is tokenized to Hello and World with delimiter option as below.

Execution example:

tokenize 'TokenDelimit("delimiter", ",")' "Hello,World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "Hello",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "World",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

delimiter option can also specify multiple delimiters.

For example, Hello, World is tokenized to Hello and World. , and `` `` are delimiters in below example.

Execution example:

tokenize 'TokenDelimit("delimiter", ",", "delimiter", " ")' "Hello, World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "Hello",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "World",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

pattern option can split token with a regular expression. You can except needless space by pattern option.

For example, This is a pen. This is an apple is tokenized to This is a pen and This is an apple with pattern option as below.

Normally, when This is a pen. This is an apple. is splitted by ., needless spaces are included at the beginning of “This is an apple.”.

You can except the needless spaces by a pattern option as below example.

Execution example:

tokenize 'TokenDelimit("pattern", "\\.\\s*")' "This is a pen. This is an apple."
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "This is a pen.",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "This is an apple.",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

You can extract token in complex conditions by pattern option.

For example, これはペンですか!?リンゴですか?「リンゴです。」 is tokenize to これはペンですか and リンゴですか, 「リンゴです。」 with delimiter option as below.

Execution example:

tokenize 'TokenDelimit("pattern", "([。!?]+(?![)」])|[\\r\\n]+)\\s*")' "これはペンですか!?リンゴですか?「リンゴです。」"
# [
#   [
#     0,
#     1545179416.22277,
#     0.0002887248992919922
#   ],
#   [
#     {
#       "value": "これはペンですか",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "リンゴですか",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "「リンゴです。」",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

\\s* of the end of above regular expression match 0 or more spaces after a delimiter.

[。!?]+ matches 1 or more or , . For example, [。!?]+ matches !? of これはペンですか!?.

(?![)」]) is negative lookahead. (?![)」]) matches if a character is not matched or . negative lookahead interprets in combination regular expression of just before.

Therefore it interprets [。!?]+(?![)」]).

[。!?]+(?![)」]) matches if there are not or after or , .

In other words, [。!?]+(?![)」]) matches of これはペンですか。. But [。!?]+(?![)」]) doesn’t match of 「リンゴです。」. Because there is after .

[\\r\\n]+ match 1 or more newline character.

In conclusion, ([。!?]+(?![)」])|[\\r\\n]+)\\s* uses and and , newline character as delimiter. However, and !, are not delimiters if there is or after or , .

7.8.3.12. TokenDelimitNull

TokenDelimitNull is similar to TokenDelimit. The difference between them is separator character. TokenDelimit uses space character (U+0020) but TokenDelimitNull uses NUL character (U+0000).

TokenDelimitNull is also suitable for tag text.

Here is an example of TokenDelimitNull:

Execution example:

tokenize TokenDelimitNull "Groonga\u0000full-text-search\u0000HTTP" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "groongau0000full-text-searchu0000http"
#     }
#   ]
# ]

7.8.3.13. TokenMecab

TokenMecab is a tokenizer based on MeCab part-of-speech and morphological analyzer.

MeCab doesn’t depend on Japanese. You can use MeCab for other languages by creating dictionary for the languages. You can use NAIST Japanese Dictionary for Japanese.

You need to install an additional package to using TokenMecab. For more detail of how to installing an additional package, see how to install each OS .

TokenMecab is good for precision rather than recall. You can find 東京都 and 京都 texts by 京都 query with TokenBigram but 東京都 isn’t expected. You can find only 京都 text by 京都 query with TokenMecab.

If you want to support neologisms, you need to keep updating your MeCab dictionary. It needs maintain cost. (TokenBigram doesn’t require dictionary maintenance because TokenBigram doesn’t use dictionary.) mecab-ipadic-NEologd : Neologism dictionary for MeCab may help you.

Here is an example of TokenMeCab. 東京都 is tokenized to 東京 and . They don’t include 京都:

Execution example:

tokenize TokenMecab "東京都"
# [
#   [
#     0,
#     1545812631.661493,
#     0.0002415180206298828
#   ],
#   [
#     {
#       "value": "東京",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "都",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

TokenMecab can also specify options. TokenMecab has target_class option, include_class option, include_reading option, include_form option and use_reading.

target_class option searches a token of specifying a part-of-speech. For example, you can search only a noun as below.

Execution example:

tokenize 'TokenMecab("target_class", "名詞")' '彼の名前は山田さんのはずです。'
# [
#   [
#     0,
#     1545810238.195525,
#     0.0003066062927246094
#   ],
#   [
#     {
#       "value": "彼",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "名前",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "山田",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "さん",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "はず",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

target_class option can also specify subclasses and exclude or add specific part-of-speech of specific using + or -. So, you can also search a noun with excluding non-independent word and suffix of person name as below.

In this way you can search exclude the noise of token.

Execution example:

tokenize 'TokenMecab("target_class", "-名詞/非自立", "target_class", "-名詞/接尾/人名", "target_class", "名詞")' '彼の名前は山田さんのはずです。'
# [
#   [
#     0,
#     1545810363.771334,
#     0.0003197193145751953
#   ],
#   [
#     {
#       "value": "彼",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "名前",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "山田",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.8.3.14. TokenRegexp

New in version 5.0.1.

Caution

This tokenizer is experimental. Specification may be changed.

Caution

This tokenizer can be used only with UTF-8. You can’t use this tokenizer with EUC-JP, Shift_JIS and so on.

TokenRegexp is a tokenizer for supporting regular expression search by index.

In general, regular expression search is evaluated as sequential search. But the following cases can be evaluated as index search:

  • Literal only case such as hello
  • The beginning of text and literal case such as \A/home/alice
  • The end of text and literal case such as \.txt\z

In most cases, index search is faster than sequential search.

TokenRegexp is based on bigram tokenize method. TokenRegexp adds the beginning of text mark (U+FFEF) at the begging of text and the end of text mark (U+FFF0) to the end of text when you index text:

Execution example:

tokenize TokenRegexp "/home/alice/test.txt" NormalizerAuto --mode ADD
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "￯"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "/h"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ho"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "om"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "me"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "e/"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "/a"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "al"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "li"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "ic"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "ce"
#     },
#     {
#       "position": 11,
#       "force_prefix": false,
#       "value": "e/"
#     },
#     {
#       "position": 12,
#       "force_prefix": false,
#       "value": "/t"
#     },
#     {
#       "position": 13,
#       "force_prefix": false,
#       "value": "te"
#     },
#     {
#       "position": 14,
#       "force_prefix": false,
#       "value": "es"
#     },
#     {
#       "position": 15,
#       "force_prefix": false,
#       "value": "st"
#     },
#     {
#       "position": 16,
#       "force_prefix": false,
#       "value": "t."
#     },
#     {
#       "position": 17,
#       "force_prefix": false,
#       "value": ".t"
#     },
#     {
#       "position": 18,
#       "force_prefix": false,
#       "value": "tx"
#     },
#     {
#       "position": 19,
#       "force_prefix": false,
#       "value": "xt"
#     },
#     {
#       "position": 20,
#       "force_prefix": false,
#       "value": "t"
#     },
#     {
#       "position": 21,
#       "force_prefix": false,
#       "value": "￰"
#     }
#   ]
# ]