问题描述
在Elasticsearch上是否有允许LSH的插件?如果是,您能指出我的位置并告诉我一些使用方法吗?谢谢
is there any plugin allowing LSH on Elasticsearch? If yes, could you point me to the location and tell me a little how to use it?Thanks
我发现ES使用MinHash插件.我该如何比较文档呢?找到重复项的好设置是什么?
I found out that ES uses MinHash plugin. How could I compare documents to one another with this? What would be a good setting to find duplicates?
推荐答案
-
有一个 Elasticsearch MinHash插件.每次为文档建立索引并稍后用minhash查询该文档时,都可以使用它来提取minhash值.
There is a Elasticsearch MinHash Plugin. You can use it to extract minhash value every time you index a document and query the document by minhash later.
-
安装MinHash插件:
Install MinHash plugin:
$ $ES_HOME/bin/plugin install org.codelibs/elasticsearch-minhash/2.3.1
在创建索引时添加一个minhash分析器:
Add a minhash analyzer when creating your index:
$ curl -XPUT 'localhost:9200/my_index' -d '{
"index":{
"analysis":{
"analyzer":{
"minhash_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["minhash"]
}
}
}
}
}'
将minhash_value
字段放入索引映射:
Put minhash_value
field into an index mapping:
$ curl -XPUT "localhost:9200/my_index/my_type/_mapping" -d '{
"my_type":{
"properties":{
"message":{
"type":"string",
"copy_to":"minhash_value"
},
"minhash_value":{
"type":"minhash",
"minhash_analyzer":"minhash_analyzer"
}
}
}
}'
a. 使用类似此查询的内容可用于在minhash_value
字段上进行喜欢"搜索:
a. Use More like this query can be used to do "like" search on the minhash_value
field:
GET /_search
{
"query": {
"more_like_this" : {
"fields" : ["minhash_value"],
"like" : "KV5rsUfZpcZdVojpG8mHLA==",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
b.您还可以使用模糊查询,但它接受查询的结果与2
(最大值)相差最大.
b. You can also use fuzzy query but it accepts the query to differ from the result by 2
(maximum).
GET /_search
{
"query": {
"fuzzy" : { "minhash_value" : "KV5rsUfZpcZdVojpG8mHLA==" }
}
}
您可以找到有关模糊查询的更多信息这里.
You can find more about the fuzzy query here.
这篇关于局部敏感的哈希-Elasticsearch的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!