elasticsearch - 用Elasticsearch进行子序列搜索

我在Elasticsearch中设置了一个简单的索引，我想对它执行"Github like"子序列搜索:

{
  "files" : {
    "aliases" : { },
    "mappings" : {
      "file" : {
        "properties" : {
          "name" : {
            "type" : "text",
            "analyzer" : "simple"
          }
        }
      }
    }
  }
}

然后添加一个文档:

curl -XPOST 'localhost:9200/files/file' -H 'Content-Type: application/json' -d'
{
  "name": "/my/path/to/file.txt"
}
'

当我用
"query": { "match": {"name": {"query": "mypath", "fuzziness": "AUTO" }} }
我按预期得到了文件。如果我但是查询
"query": { "match": {"name": {"query": "mypathto", "fuzziness": "AUTO" }} }
该文件不再返回。

基本上我希望文档的所有子序列都可以匹配，例如:

mat / t => / m 是

mx => / m y / path / to / file.t x t

mypathtofiletxt => / 我的 / 路径 / 到 / file.txt

最佳答案

关于长期

简单分析器将您的文件名拆分为一组小标记:

curl -XPOST "http://localhost:9200/test/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "simple",
  "text": "/my/path/to/file.txt"
}'

这意味着您的索引中包含以下术语[“my”，“path”，“to”，“file”，“txt”]

匹配查询接受文本并使用与索引时相同的分析器对其进行分析，即:

curl -XPOST "http://localhost:9200/test/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "simple",
  "text": "mypathto"
}'

因此，在您的情况下，匹配查询会尝试找到 mapathto 术语。这在小名的情况下有效，因为模糊查询通过词路径匹配词组 mypath ，而没有2个符号(2个版本)我的路径。作为解决方案，您可以使用以下查询:

curl -XGET "http://localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match": {"name": {"query": "/my/path/to", "fuzziness": "AUTO" }} }
}'

GitHub赞查询

没有为此提供现成的解决方案，但是您可以执行以下操作:

索引字段作为关键字-不分析=保持

不变

仅使用小写规范化器(而不是分析器)

在每个符号

之间使用带*的通配符查询

请注意，通配符查询可能会很慢，因为它需要迭代许多项(内部使用http://www.brics.dk/automaton/，它对于小索引来说是最佳且快速的)，因此性能取决于具有相似子序列的唯一项的数量。为了优化性能，您可以为每个项目使用唯一索引。这是简单的示例:

PUT test
{
  "settings": {
    "analysis": {
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "file": {
      "properties": {
        "name": {
          "type": "keyword",
          "normalizer": "lowercase_normalizer"
        }
      }
    }
  }
}

PUT test/file/1
{
  "name": "/my/path/to/file.txt"
}

GET test/_search
{
  "query": { "wildcard": {"name": {"value": "*m*a*t*/*t*"}} }
}

GET test/_search
{
  "query": { "wildcard": {"name": {"value": "*m*x*"}} }
}


GET test/_search
{
  "query": { "wildcard": {"name": {"value": "*m*y*p*a*t*h*t*o*f*i*l*e*t*x*t*"}} }
}

另请注意，通配符查询不支持分析器/ Normalyzers，因此您必须在客户端上小写您的请求。

关于elasticsearch - 用Elasticsearch进行子序列搜索，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/47059439/