本文介绍了正确排序以进行精确匹配,并以“开头为” (前缀)在Elasticsearch中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在Elasticsearch上改进搜索结果列表。



假设我们有3个文档,其中包含单个字段和内容,如下所示:




  • 苹果

  • 青苹果

  • 苹果树



如果我搜索苹果,则可能会出现这样的结果:




  • 青苹果

  • 苹果树

  • 苹果


但是我想要的是具有最高分数的精确匹配项,这里是带有 apple的文档。



下一个得分最高的应该是搜索词开头的条目,这里是苹果树,其余按默认方式排序。



所以我想拥有它:




  • 苹果

  • 苹果树

  • 青苹果



我试图通过使用rescore来实现它:

  curl -X GET http:// localhost:9200 / my_ index_name / _search?size = 10& pretty -H'内容类型:application / json'-d'
{
query:{
query_string:{
query: apple
}
},
rescore:{
window_size:500,
query:{
score_mode:乘法,
rescore_query:{
bool:{
应该:[
{
match:{
my_field1:{
query: apple,
boost:4
}
}
},
{
match:{
my_field1:{
query: apple *,
boost:2
}
}
}
]
}
},
query_weight:0.7,
rescore_query_weight:1.2
}
}
}'

但这并不是真的,因为Elasticsearch似乎用空格将所有单词分隔开。例如,搜索 apple *也将投放 green apple。



可能还有其他字符,例如点。,-,;。



我也玩过 rescore_query中的 match_phrase而不是 bool,但没有成功。



我也只尝试过一次匹配:

  curl -X GET http:// localhost:9200 / my_index_name / _search?size = 10& pretty -H'内容类型:application / json'-d'
{
query:{
query_string:{
query: apple
}
},
rescore:{
window_size:500,
查询:{
score_mode:乘,
rescore_query:{
布尔:{
应该:[
{
match:{
my_field1:{
query: apple *,
boost:2
}
}
}
]
}
},
query_weight:0.7,
rescore_query_weight:1.2
}
}
}'
b

这似乎可行,但我仍然不确定。



EDIT1:使用其他查询时,一个匹配结果无法正常工作。

解决方案

您需要在分数中进行操作的唯一位置是完全匹配,否则,按词条位置的顺序将为您提供正确的顺序。让我们通过以下内容了解这一点:



首先创建一个如下所示的映射:

  PUT测试
{
映射:{
_doc:{
属性:{
my_field1:{
类型:文本,
分析器:空白,
字段:{
关键字:{
类型:关键字
}
}
}
}
}
}
}

我已经使用空白分析器创建了字段 my_field1 确保通过仅将空格用作分隔符来创建令牌。其次,我创建了一个名为 keyword 的名为 keyword 的子字段。 关键字将保存输入字符串的未经分析的值,我们将使用它进行完全匹配。



让在索引中添加一些文档:

  PUT test / _doc / 1 
{
my_field1: apple
}

PUT测试/ _doc / 2
{
my_field1: apple tree
}

PUT test / _doc / 3
{
my_field1:青苹果
}

如果使用以下查询来搜索术语 apple ,则文档顺序将为
2,1、3。

  POST测试/ _doc / _search 
{
explain:true,
query :{{
query_string:{
query: apple,
fields:[
my_field1
]
}
}
}

解释:true 在输出中给出分数计算步骤。阅读此书将使您深入了解文档的得分。



我们要做的就是提高得分的准确率。我们将对 my_field1.keyword 字段进行完全匹配。您可能会有一个问题,为什么不 my_field1 。这样做的原因是因为分析了 my_field1 ,当为3个文档的输入字符串生成令牌时,所有令牌都将具有令牌(术语)苹果(如果存在其他术语,例如文档2的和文档绿色 3)按此字段存储。当我们在术语 apple 的此字段上运行完全匹配时,所有文档都会匹配,并且对每个文档的得分都会产生相似的影响,因此得分没有变化。由于只有一个文档具有与 my_field1.keyword 相对的 apple 的确切值,因此该文档(文档1)将与确切的查询,我们将对此进行增强。因此查询将是:

  {
query:{
bool:{
应该:[
{
query_string:{
query: apple,
fields:[
my_field1
]
}
},
{
query_string:{
query: \ apple\,
fields:[
my_field1.keyword ^ 2
]
}
}
]
}
}
}

上述查询的输出:

  {
took:9,9,
timed_out:false,
_shards:{
总计:5,
成功:5,
跳过:0,
失败:0
},
点击 :{{
total:3,
max_score:1.7260925,
hits:[
{
_index: test3,
_type: _ doc,
_id: 1,
_score:1.7260925,
_source:{
my_field1: apple
}
},
{
_index: test3,
_type: _doc,
_id: 2,
_score:0.6931472,
_source:{
my_field1:苹果树
}
},
{
_index: test3,
_type: _doc,
_id: 3,
_score:0.2876821,
_source:{
my_field1: 青苹果
}
}
]
}
}


I need to improve the result list on search with Elasticsearch.

Lets say we have 3 documents with single field and content like this:

  • "apple"
  • "green apple"
  • "apple tree"

If I search for "apple", it can happen, that I get the result sorted like this:

  • "green apple"
  • "apple tree"
  • "apple"

But what I want is the exact match to have the highest score, here it is the document with "apple".

Next highest score should be the entries beginning with the search word, here it is "apple tree" and rest sorted default way.

So I want to have it this:

  • "apple"
  • "apple tree"
  • "green apple"

I have tried to achieve it by using rescore:

curl -X GET "http://localhost:9200/my_index_name/_search?size=10&pretty" -H 'Content-Type: application/json' -d'
{
   "query": {
      "query_string": {
          "query": "apple"
      }
   },
   "rescore": {
      "window_size": 500,
      "query": {
         "score_mode": "multiply",
         "rescore_query": {
            "bool": {
               "should": [
                  {
                     "match": {
                        "my_field1": {
                           "query": "apple",
                           "boost": 4
                        }
                     }
                  },
                  {
                     "match": {
                        "my_field1": {
                           "query": "apple*",
                           "boost": 2
                        }
                     }
                  }
               ]
            }
         },
         "query_weight": 0.7,
         "rescore_query_weight": 1.2
      }
   }
}'

But this not really works, because Elasticsearch seems to separate all words by white spaces. For example search for "apple*" will also deliver "green apple". That seems to be the reason why rescore is not working for me.

Possibly there are other characters like dots ".", "-", ";" etc. which Elasticsearch takes for splitting and mess up my sorting.

I also played around with "match_phrase" in "rescore_query" instead of "bool", but without success.

I also have tried with only one match this:

curl -X GET "http://localhost:9200/my_index_name/_search?size=10&pretty" -H 'Content-Type: application/json' -d'
{
   "query": {
      "query_string": {
          "query": "apple"
      }
   },
   "rescore": {
      "window_size": 500,
      "query": {
         "score_mode": "multiply",
         "rescore_query": {
            "bool": {
               "should": [
                  {
                     "match": {
                        "my_field1": {
                           "query": "apple*",
                           "boost": 2
                        }
                     }
                  }
               ]
            }
         },
         "query_weight": 0.7,
         "rescore_query_weight": 1.2
      }
   }
}'

And it seems to work, but I am still not sure. Would this be the correct way to do it?

EDIT1: With other queries the one match rescore is not working correct.

解决方案

The only place where you require a manipulation in score is the exact match otherwise the order by position of terms give you the correct order. Lets understand this by the following:

Lets first create a mapping as below:

PUT test
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field1": {
          "type": "text",
          "analyzer": "whitespace",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

I have create field my_field1 with whitespace analyzer to make sure tokens are created by using space as only delimiter. Secondly, I have created a subfield named as keyword of type keyword. keyword will hold non-analyzed value of the input string and we'll use this for exact match.

Lets add few docs to the index:

PUT test/_doc/1
{
  "my_field1": "apple"
}

PUT test/_doc/2
{
  "my_field1": "apple tree"
}

PUT test/_doc/3
{
  "my_field1": "green apple"
}

If use the below query to search for term apple the order of docs will be2,1,3.

POST test/_doc/_search
{
  "explain": true,
  "query": {
    "query_string": {
      "query": "apple",
      "fields": [
        "my_field1"
      ]
    }
  }
}

"explain": true in the above query give the score calculation steps in the output. Reading this will give you insight how a document is score.

All we need to do is, to boost the score for exact match. We'll run exact match against the field my_field1.keyword. You might have a question that why not my_field1. The reason for this is because my_field1 is analyzed, when tokens are generated for the input strings of the 3 docs, all will have a token (term) apple (along with other terms if present e.g. tree for doc 2 and green for doc 3) stored against this field. When we run exact match on this field for the term apple all docs will match and have similar effect on score for each document and hence no change in score. Since only one document have exact value as apple against my_field1.keyword that document (doc 1) will be a match for exact query and we'll boost this. So the query will be:

{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "query": "apple",
            "fields": [
              "my_field1"
            ]
          }
        },
        {
          "query_string": {
            "query": "\"apple\"",
            "fields": [
              "my_field1.keyword^2"
            ]
          }
        }
      ]
    }
  }
}

Output for above query:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.7260925,
    "hits": [
      {
        "_index": "test3",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.7260925,
        "_source": {
          "my_field1": "apple"
        }
      },
      {
        "_index": "test3",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
          "my_field1": "apple tree"
        }
      },
      {
        "_index": "test3",
        "_type": "_doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "my_field1": "green apple"
        }
      }
    ]
  }
}

这篇关于正确排序以进行精确匹配,并以“开头为” (前缀)在Elasticsearch中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 23:48