TokenMecab
¶
概要¶
TokenMecab
は MeCab 形態素解析器をベースにしたトークナイザーです。
MeCabは日本語に依存していません。その言語用の辞書を用意すれば日本語以外でもMeCabを使えます。日本語用の辞書には NAIST Japanese Dictionary を使えます。
TokenMecab
を使うには、追加のパッケージをインストールする必要があります。追加のパッケージをインストールする方法の詳細については、各OSのインストール方法 を参照して下さい。
TokenMecab
は再現率より適合率に優れています。 TokenBigram では 京都
というクエリーで 東京都
も 京都
も見つかりますが、この場合は 東京都
は期待した結果ではありません。 TokenMecab
を使うと 京都
というクエリーで 京都
だけを見つけられます。
新語をサポートしたい場合は、MeCabの辞書を更新し続ける必要があります。これはメンテナンスコストがかかります。( TokenBigram には辞書のメンテナンスコストはありません。なぜなら、 TokenBigram は辞書を使っていないからです。)新語への対応に mecab-ipadic-NEologd : Neologism dictionary for MeCab が役に立つかもしれません。
構文¶
TokenMecab
has optional parameter:
TokenMecab
TokenMecab("include_class", true)
TokenMecab("target_class", true)
TokenMecab("include_form", true)
TokenMecab("use_reading", true)
使い方¶
簡単な使い方¶
以下は TokenMeCab
の例です。 東京都
は 東京
と 都
にトークナイズされています。 京都
というトークンはありません。
実行例:
tokenize TokenMecab "東京都"
# [
# [
# 0,
# 1545812631.661493,
# 0.0002415180206298828
# ],
# [
# {
# "value": "東京",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "都",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
TokenMecab
can also specify options.
TokenMecab
has target_class
option, include_class
option,
include_reading
option, include_form
option and use_reading
option.
target_class
オプションは、指定した品詞のトークンを検索します。例えば、以下のように名詞のみを検索できます。
実行例:
tokenize 'TokenMecab("target_class", "名詞")' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1545810238.195525,
# 0.0003066062927246094
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "名前",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "山田",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "さん",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "はず",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
include_class
option outputs class and subclass in Mecab’s metadata as below.
実行例:
tokenize 'TokenMecab("include_class", true)' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1545892715.887472,
# 0.03757452964782715
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "名詞",
# "subclass0": "代名詞",
# "subclass1": "一般"
# }
# },
# {
# "value": "の",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "助詞",
# "subclass0": "連体化"
# }
# },
# {
# "value": "名前",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "名詞",
# "subclass0": "一般"
# }
# },
# {
# "value": "は",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "助詞",
# "subclass0": "係助詞"
# }
# },
# {
# "value": "山田",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "名詞",
# "subclass0": "固有名詞",
# "subclass1": "人名",
# "subclass2": "姓"
# }
# },
# {
# "value": "さん",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "名詞",
# "subclass0": "接尾",
# "subclass1": "人名"
# }
# },
# {
# "value": "の",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "助詞",
# "subclass0": "連体化"
# }
# },
# {
# "value": "はず",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "名詞",
# "subclass0": "非自立",
# "subclass1": "一般"
# }
# },
# {
# "value": "です",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "助動詞"
# }
# },
# {
# "value": "。",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "記号",
# "subclass0": "句点"
# }
# }
# ]
# ]
You can exclude needless token with target_class
and class and sub class of this option outputs.
include_reading
outputs reading in Mecab’s metadata as below.
実行例:
tokenize 'TokenMecab("include_reading", true)' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1545892913.226588,
# 0.0003414154052734375
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "カレ"
# }
# },
# {
# "value": "の",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ノ"
# }
# },
# {
# "value": "名前",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ナマエ"
# }
# },
# {
# "value": "は",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ハ"
# }
# },
# {
# "value": "山田",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ヤマダ"
# }
# },
# {
# "value": "さん",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "サン"
# }
# },
# {
# "value": "の",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ノ"
# }
# },
# {
# "value": "はず",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ハズ"
# }
# },
# {
# "value": "です",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "デス"
# }
# },
# {
# "value": "。",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "。"
# }
# }
# ]
# ]
You can get reading of a token with this option.
include_form
outputs inflected_type, inflected_form and base_form in Mecab’s metadata as below.
実行例:
tokenize 'TokenMecab("include_form", true)' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1545892987.209944,
# 0.0004286766052246094
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "彼"
# }
# },
# {
# "value": "の",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "の"
# }
# },
# {
# "value": "名前",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "名前"
# }
# },
# {
# "value": "は",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "は"
# }
# },
# {
# "value": "山田",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "山田"
# }
# },
# {
# "value": "さん",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "さん"
# }
# },
# {
# "value": "の",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "の"
# }
# },
# {
# "value": "はず",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "はず"
# }
# },
# {
# "value": "です",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "inflected_type": "特殊・デス",
# "inflected_form": "基本形",
# "base_form": "です"
# }
# },
# {
# "value": "。",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "。"
# }
# }
# ]
# ]
use_reading
supports a search by kana.
This option is useful for countermeasure of orthographical variants because it searches with kana.
実行例:
tokenize 'TokenMecab("use_reading", true)' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1545893087.556662,
# 0.0003693103790283203
# ],
# [
# {
# "value": "カレ",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ノ",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ナマエ",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ハ",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ヤマダ",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "サン",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ノ",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ハズ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "デス",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "。",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
高度な使い方¶
target_class
オプションは、サブクラスを指定することや、 + や - を使って、特定の品詞を追加または、除外することもできます。したがって、以下のように人名の接尾語と非自立語を除いた名詞を検索することもできます。
このようにして、ノイズとなるトークンを除外して検索できます。
実行例:
tokenize 'TokenMecab("target_class", "-名詞/非自立", "target_class", "-名詞/接尾/人名", "target_class", "名詞")' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1545810363.771334,
# 0.0003197193145751953
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "名前",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "山田",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
In addition, you can get reading of a token that exclude the noise with include_reading
option as below.
実行例:
tokenize 'TokenMecab("target_class", "-名詞/非自立", "target_class", "-名詞/接尾/人名", "target_class", "名詞", "include_reading", true)' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1545893197.914959,
# 0.0003139972686767578
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "カレ"
# }
# },
# {
# "value": "名前",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ナマエ"
# }
# },
# {
# "value": "山田",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ヤマダ"
# }
# }
# ]
# ]
引数¶
省略可能引数¶
There are four optional parameters include_class
, target_class
, include_form
and use_reading
.
include_class
¶
Outputs class and subclass in Mecab’s metadata.
target_class
¶
Outputs a token of specifying a part-of-speech.
include_form
¶
Outputs inflected_type, inflected_form and base_form in Mecab’s metadata.
use_reading
¶
Outputs reading of token.