7.15.15. language_model_knn#
Added in version 15.1.8.
Note
This is an experimental feature. Currently, this feature is still not stable.
7.15.15.1. Summary#
language_model_knn is a function for semantic search.
Semantic search uses the k-Nearest Neighbors (k-NN) algorithm.
You must use it with TokenLanguageModelKNN.
It can be used as a condition for --filter and as a sort key for --sort_keys.
To enable this function, register language_model/knn plugin by the following command:
plugin_register language_model/knn
7.15.15.2. Syntax#
language_model_knn requires two parameters:
language_model_knn(column, query)
column is the search target column. It must be a column with an index.
query is a search query.
7.15.15.3. Requirements#
You need Faiss enabled Groonga. The official packages enable it.
7.15.15.4. Usage#
You need to register language_model/knn plugin at first:
Execution example:
plugin_register language_model/knn
# [[0,1337566253.89858,0.000355720520019531],true]
Here is a schema definition and sample data.
Sample schema:
Execution example:
table_create --name Memos --flags TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create \
--table Memos \
--name content \
--flags COLUMN_SCALAR \
--type ShortText
# [[0,1337566253.89858,0.000355720520019531],true]
Sample data:
Execution example:
load --table Memos
[
{"content": "I am a boy."},
{"content": "This is an apple."},
{"content": "Groonga is a full text search engine."}
]
# [[0,1337566253.89858,0.000355720520019531],3]
You need to store embedding information for each record. Here is how to create that column.
Execution example:
column_create Memos embedding_code COLUMN_SCALAR ShortBinary
# [[0,1337566253.89858,0.000355720520019531],true]
Create an index for semantic search.
Specify TokenLanguageModelKNN as the tokenizer.
The tokenizer’s arguments are model and code_column.
Specify the model to use for model, and specify the column to store the generated embedding information for code_column.
Execution example:
table_create Centroids TABLE_HASH_KEY ShortBinary \
--default_tokenizer \
'TokenLanguageModelKNN("model", "hf:///groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF", \
"code_column", "embedding_code")'
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Centroids data_content COLUMN_INDEX Memos content
# [[0,1337566253.89858,0.000355720520019531],true]
This enables semantic search.
When you load data into Memos.content, Groonga automatically generates embeddings.
Users do not need to generate embeddings.
Here is an example of semantic search:
Execution example:
select Memos \
--filter 'language_model_knn(content, "male child")' \
--output_columns content
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 3
# ],
# [
# [
# "content",
# "ShortText"
# ]
# ],
# [
# "I am a boy."
# ],
# [
# "This is an apple."
# ],
# [
# "Groonga is a full text search engine."
# ]
# ]
# ]
# ]
language_model_knn function can also be used as a sort key.
Specify language_model_knn for --sort_keys.
Since you likely need to fetch results in descending order of similarity, you add a - prefix to fetch them in descending order.
Here is an example of filtering by _id and then sorting by similarity:
Execution example:
select Memos \
--filter '_id < 3' \
--sort_keys '-language_model_knn(content, "male child")' \
--output_columns content
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 2
# ],
# [
# [
# "content",
# "ShortText"
# ]
# ],
# [
# "I am a boy."
# ],
# [
# "This is an apple."
# ]
# ]
# ]
# ]
7.15.15.5. Parameters#
There are two required parameters.
7.15.15.5.1. column#
column is the search target column. It must be a column with an index.
7.15.15.5.2. query#
query is a search query.
7.15.15.6. Return value#
This function works as a selector. It means that this function executes effectively.