我有两个文档的字段 title 为:

  • 新闻
  • 新网站

  • 如果我搜索 new website 一词,那么 News 文档的分数是 比另一个 高很多,这显然不是我想要的。我围绕它进行了解释并得到:

    'hits': [{'_explanation': {'desc': 'product of:',
       'det': [{'desc': 'sum of:',
        'det': [{'desc': 'product of:',
         'det': [{'desc': 'sum of:',
          'det': [{'desc': 'weight(title:new in 0) [PerFieldSimilarity], result of:',
           'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
            'det': [{'desc': 'queryWeight, product of:',
             'det': [{'desc': 'idf(docFreq=1, maxDocs=6)',
              'value': 2.0986123},
              {'desc': 'queryNorm',
               'value': 0.14544667}],
              'value': 0.3052362},
              {'desc': 'fieldWeight in 0, product of:',
               'det': [{'desc': 'tf(freq=1.0), with freq of:',
                'det': [{'desc': 'termFreq=1.0',
                 'value': 1.0}],
                'value': 1.0},
                {'desc': 'idf(docFreq=1, maxDocs=6)',
                 'value': 2.0986123},
                {'desc': 'fieldNorm(doc=0)',
                 'value': 0.625}],
                'value': 1.3116326}],
              'value': 0.40035775}],
           'value': 0.40035775}],
          'value': 0.40035775},
          {'desc': 'coord(1/2)',
           'value': 0.5}],
          'value': 0.20017888}],
        'value': 0.20017888},
        {'desc': 'coord(1/2)',
         'value': 0.5}],
        'value': 0.10008944},
        '_id': '2ff1307b536102e41e7daaccaf7edc69b16a348c',
        '_index': 'scrapy',
        '_node': 'D9SgrDb5RnO4NMAJMHiAOA',
        '_score': 0.100089446,
        '_shard': 3,
        '_source': {'title': ['\n       News ?  E/CIS\n    '],
         'url': 'http://178.4.12.128:8888/news/'},
        '_type': 'pages'},
        {'_explanation': {'desc': 'product of:',
         'det': [{'desc': 'sum of:',
          'det': [{'desc': 'sum of:',
           'det': [{'desc': 'weight(title:new in 0) [PerFieldSimilarity], result of:',
            'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
             'det': [{'desc': 'queryWeight, product of:',
              'det': [{'desc': 'idf(docFreq=1, maxDocs=1)',
               'value': 0.30685282},
               {'desc': 'queryNorm',
                'value': 0.46183997}],
               'value': 0.1417169},
               {'desc': 'fieldWeight in 0, product of:',
                'det': [{'desc': 'tf(freq=1.0), with freq of:',
                 'det': [{'desc': 'termFreq=1.0',
                  'value': 1.0}],
                 'value': 1.0},
                 {'desc': 'idf(docFreq=1, maxDocs=1)',
                  'value': 0.30685282},
                 {'desc': 'fieldNorm(doc=0)',
                  'value': 0.5}],
                 'value': 0.15342641}],
               'value': 0.021743115}],
            'value': 0.021743115},
            {'desc': 'weight(title:websit in 0) [PerFieldSimilarity], result of:',
             'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
              'det': [{'desc': 'queryWeight, product of:',
               'det': [{'desc': 'idf(docFreq=1, maxDocs=1)',
                'value': 0.30685282},
                {'desc': 'queryNorm',
                 'value': 0.46183997}],
                'value': 0.1417169},
                {'desc': 'fieldWeight in 0, product of:',
                 'det': [{'desc': 'tf(freq=1.0), with freq of:',
                  'det': [{'desc': 'termFreq=1.0',
                   'value': 1.0}],
                  'value': 1.0},
                  {'desc': 'idf(docFreq=1, maxDocs=1)',
                   'value': 0.30685282},
                  {'desc': 'fieldNorm(doc=0)',
                   'value': 0.5}],
                  'value': 0.15342641}],
                'value': 0.021743115}],
             'value': 0.021743115}],
            'value': 0.04348623}],
          'value': 0.04348623},
          {'desc': 'coord(1/2)',
           'value': 0.5}],
          'value': 0.021743115},
          '_id': '265988d175a2b4a2ae2e462509089d5f701ed372',
          '_index': 'scrapy',
          '_node': 'D9SgrDb5RnO4NMAJMHiAOA',
        '_score': 0.021743115,
                        '_shard': 0,
                        '_source': {'title': ['\n       New Website ?  E/CIS\n    '],
                          'url': 'http://178.4.12.128:8888/news/2015-new-website/'},
                        '_type': 'pages'}],
              'max_score': 0.100089446,
              'total': 2}
    

    注意我将 details 缩短为 det,将 description 缩短为 desc 以节省空间。

    看起来最大的差异是由于 maxDocs 在得分上的差异。为什么我在那里有区别?我以为这是索引中的文档数?不应该是一样的吗?

    更多细节

    以下是完整的详细信息,但可能不需要它们:

    询问

    我的查询:

     'multi_match': {
        'query': 'new website',
        'type': 'most_fields',
        'fields': ['title.raw^15', 'title^10'],
        'analyzer': 'whitespace_analyzer',
     }
    

    映射

     'title': {
         'type': 'string',
         'store': 'yes',
         "index_analyzer": "nGram_analyzer",
         "search_analyzer": "whitespace_analyzer",
         'fields': {
             'raw': {
                 'type': 'string',
                 'store': 'yes',
                 "search_analyzer": "whitespace_analyzer",
                 "index": "not_analyzed",
             },
         }
     },
    

    分析器和过滤器

      'analysis': {
          "filter": {
              "nGram_filter": {
                  "type": "nGram",
                  "min_gram": 2,
                  "max_gram": 20,
                  "token_chars": [
                      "letter",
                      "digit",
                      "punctuation",
                      "symbol"
                  ]
              },
              "english_stop": {
                  "type":       "stop",
                  "stopwords":  "_english_"
              },
              "english_stemmer": {
                  "type":       "stemmer",
                  "language":   "english"
              },
              "english_possessive_stemmer": {
                  "type":       "stemmer",
                  "language":   "possessive_english"
              }
          },
          "analyzer": {
              "html_analyzer": {
                  "type": "custom",
                  "tokenizer": "whitespace",
                  "char_filter": ["html_strip"],
                  "filter": [
                      'english_possessive_stemmer',
                      "lowercase",
                      'english_stop',
                      'english_stemmer',
                      "asciifolding",
                  ]
              },
              "nGram_analyzer": {
                  "type": "custom",
                  "tokenizer": "whitespace",
                  "char_filter": ["html_strip"], # Strips the html tags
                  "filter": [
                      'english_possessive_stemmer',
                      "lowercase",
                      'english_stop',
                      'english_stemmer',
                      "asciifolding",
                      "nGram_filter"
                  ]
              },
              "whitespace_analyzer": {
                  "type": "custom",
                  "tokenizer": "whitespace",
                  "filter": [
                      'english_possessive_stemmer',
                      "lowercase",
                      'english_stop',
                      'english_stemmer',
                      "asciifolding",
                  ]
              }
    

    最佳答案

    默认搜索类型是 query_then_fetch
    query_then_fetch 和 query_and_fetch 都涉及计算索引中每个分片本地的术语和文档频率。

    但是,如果您想要更准确地计算术语/文档频率,可以使用 dfs_query_then_fetch/dfs_query_and_fetch 。这里的频率是在所有涉及的索引分片上计算的。

    这个article给出了更详细的解释

    关于elasticsearch - 由于 IDF 的 maxDocs 不同而导致得分不佳,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/30983765/

    10-10 05:36