elasticsearch - 为什么在Elasticsearch的全文搜索中，比完全不完全匹配的匹配项得分更低？

我从 flex 搜索中搜索了一些数据，因为与MongoDB相比，它提供了更好的全文搜索。但是我面临一些问题，其中之一是:

我的数据保存在elasticsearch中，例如:

[{
   "word": "tidak berpuas hati",
   "type": "NEGATIVE",
   "score": -0.3908697916666666
  },{
   "word": "berpuas hati",
   "type": "POSITIVE",
   "score": 0.65375
  },{
   "word": "hati",
   "type": "POSITIVE",
   "score": 0.6
  },{
   "word": "tidak",
   "type": "NEGATIVE",
   "score": 0.6
}]

但是，当我在此数据中搜索saya tidak berpuas hati句子时。我得到这样的回应:

"hits": [
 {
    "_index": "sentiment",
    "_type": "ms",
    "_id": "8SPiimYBKsyQt_Jg1VYa",
    "_score": 8.838576,
    "_source": {
       "word": "berpuas hati",
       "type": "POSITIVE",
       "score": 0.65375
    },
    "highlight": {
       "word": [
          "<em>berpuas</em> <em>hati</em>"
       ]
    }
 },
 {
    "_index": "sentiment",
    "_type": "ms",
    "_id": "PiPiimYBKsyQt_Jg1U4U",
    "_score": 8.774891,
    "_source": {
       "word": "tidak berpuas hati",
       "type": "NEGATIVE",
       "score": -0.3908697916666666
    },
    "highlight": {
       "word": [
          "<em>tidak</em> <em>berpuas</em> <em>hati</em>"
       ]
    }
 },
 {
    "_index": "sentiment",
    "_type": "ms",
    "_id": "ByPiimYBKsyQt_Jg1VUZ",
    "_score": 5.045017,
    "_source": {
       "word": "hati",
       "type": "POSITIVE",
       "score": 0.6
    },
    "highlight": {
       "word": [
          "<em>hati</em>"
       ]
    }
  }
]

这是我的查询:

query = {
            "from": 0,
            "size": 20,
            "query": {
                "match": {
                    "word": {
                        "query": term,
                        "operator": 'or',
                        "fuzziness": 'auto'
                    }
                }
            },
            "highlight": {
                "fields": {
                    "word": {}
                }
            }
        }

所以这里的问题是我不明白为什么tidak berpuas hati得分不超过berpuas hati。当我将from的值更改为1时，它开始在此句子中起作用，并在单个单词句子中停止。

最佳答案

每个分片都会计算Elasticsearch分数。

在这种情况下，使用berpuas hati的文档将获得更高的分数，因为它在分片中比使用tidak berpus hati的文档更相关。

Elasticsearch的相关性是由多个因素决定的，尽管在这里我要说的原因是因为tidak berpuas hati -shard内部包含的文件比tidak -shard中包含更多(或多个)术语berpuas hati或berpuas hati的文档多。。这是巧合。

如果您对仅包含这两个文档的索引进行相同的查询，您会发现berpuas hati的得分约为0.5，而tidak berpuas hati的得分约为0.75。

您可以通过在查询中添加"explain": true来找到分数的解释。评分算法在此处进行了说明:https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

您可能还需要阅读以下内容:https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-is-broken.html