我有两个文档的字段 title
为:
如果我搜索
new website
一词,那么 News 文档的分数是 比另一个 高很多,这显然不是我想要的。我围绕它进行了解释并得到:'hits': [{'_explanation': {'desc': 'product of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'product of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'weight(title:new in 0) [PerFieldSimilarity], result of:',
'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
'det': [{'desc': 'queryWeight, product of:',
'det': [{'desc': 'idf(docFreq=1, maxDocs=6)',
'value': 2.0986123},
{'desc': 'queryNorm',
'value': 0.14544667}],
'value': 0.3052362},
{'desc': 'fieldWeight in 0, product of:',
'det': [{'desc': 'tf(freq=1.0), with freq of:',
'det': [{'desc': 'termFreq=1.0',
'value': 1.0}],
'value': 1.0},
{'desc': 'idf(docFreq=1, maxDocs=6)',
'value': 2.0986123},
{'desc': 'fieldNorm(doc=0)',
'value': 0.625}],
'value': 1.3116326}],
'value': 0.40035775}],
'value': 0.40035775}],
'value': 0.40035775},
{'desc': 'coord(1/2)',
'value': 0.5}],
'value': 0.20017888}],
'value': 0.20017888},
{'desc': 'coord(1/2)',
'value': 0.5}],
'value': 0.10008944},
'_id': '2ff1307b536102e41e7daaccaf7edc69b16a348c',
'_index': 'scrapy',
'_node': 'D9SgrDb5RnO4NMAJMHiAOA',
'_score': 0.100089446,
'_shard': 3,
'_source': {'title': ['\n News ? E/CIS\n '],
'url': 'http://178.4.12.128:8888/news/'},
'_type': 'pages'},
{'_explanation': {'desc': 'product of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'weight(title:new in 0) [PerFieldSimilarity], result of:',
'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
'det': [{'desc': 'queryWeight, product of:',
'det': [{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'queryNorm',
'value': 0.46183997}],
'value': 0.1417169},
{'desc': 'fieldWeight in 0, product of:',
'det': [{'desc': 'tf(freq=1.0), with freq of:',
'det': [{'desc': 'termFreq=1.0',
'value': 1.0}],
'value': 1.0},
{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'fieldNorm(doc=0)',
'value': 0.5}],
'value': 0.15342641}],
'value': 0.021743115}],
'value': 0.021743115},
{'desc': 'weight(title:websit in 0) [PerFieldSimilarity], result of:',
'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
'det': [{'desc': 'queryWeight, product of:',
'det': [{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'queryNorm',
'value': 0.46183997}],
'value': 0.1417169},
{'desc': 'fieldWeight in 0, product of:',
'det': [{'desc': 'tf(freq=1.0), with freq of:',
'det': [{'desc': 'termFreq=1.0',
'value': 1.0}],
'value': 1.0},
{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'fieldNorm(doc=0)',
'value': 0.5}],
'value': 0.15342641}],
'value': 0.021743115}],
'value': 0.021743115}],
'value': 0.04348623}],
'value': 0.04348623},
{'desc': 'coord(1/2)',
'value': 0.5}],
'value': 0.021743115},
'_id': '265988d175a2b4a2ae2e462509089d5f701ed372',
'_index': 'scrapy',
'_node': 'D9SgrDb5RnO4NMAJMHiAOA',
'_score': 0.021743115,
'_shard': 0,
'_source': {'title': ['\n New Website ? E/CIS\n '],
'url': 'http://178.4.12.128:8888/news/2015-new-website/'},
'_type': 'pages'}],
'max_score': 0.100089446,
'total': 2}
注意我将
details
缩短为 det
,将 description
缩短为 desc
以节省空间。看起来最大的差异是由于 maxDocs 在得分上的差异。为什么我在那里有区别?我以为这是索引中的文档数?不应该是一样的吗?
更多细节
以下是完整的详细信息,但可能不需要它们:
询问
我的查询:
'multi_match': {
'query': 'new website',
'type': 'most_fields',
'fields': ['title.raw^15', 'title^10'],
'analyzer': 'whitespace_analyzer',
}
映射
'title': {
'type': 'string',
'store': 'yes',
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
'fields': {
'raw': {
'type': 'string',
'store': 'yes',
"search_analyzer": "whitespace_analyzer",
"index": "not_analyzed",
},
}
},
分析器和过滤器
'analysis': {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"html_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": ["html_strip"],
"filter": [
'english_possessive_stemmer',
"lowercase",
'english_stop',
'english_stemmer',
"asciifolding",
]
},
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": ["html_strip"], # Strips the html tags
"filter": [
'english_possessive_stemmer',
"lowercase",
'english_stop',
'english_stemmer',
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
'english_possessive_stemmer',
"lowercase",
'english_stop',
'english_stemmer',
"asciifolding",
]
}
最佳答案
默认搜索类型是 query_then_fetch 。
query_then_fetch 和 query_and_fetch 都涉及计算索引中每个分片本地的术语和文档频率。
但是,如果您想要更准确地计算术语/文档频率,可以使用 dfs_query_then_fetch/dfs_query_and_fetch 。这里的频率是在所有涉及的索引分片上计算的。
这个article给出了更详细的解释
关于elasticsearch - 由于 IDF 的 maxDocs 不同而导致得分不佳,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/30983765/