这个问题类似于Val回答的其他问题enter link description here。
我有一个包含3个文档的索引。
{
"firstname": "Anne",
"lastname": "Borg",
}
{
"firstname": "Leanne",
"lastname": "Ray"
},
{
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
当我搜索“Ann”时,我希望Elastic返回所有这3个文档(因为它们在一定程度上都与术语“Ann”匹配)。但是,我希望Leanne Ray的得分(相关性排名)较低,因为搜索词“Ann”在该文档中的出现位置要比其他两个文档中的出现位置晚。
这是我的索引设置...
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit",
"custom"
],
"custom_token_chars": "'-",
"min_gram": "1",
"type": "ngram",
"max_gram": "2"
}
}
}
},
"mappings": {
"properties": {
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"middlename": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"copy_to": [
"full_name"
]
},
"full_name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
以下查询带回了预期的文档,但对Leanne Ray的评分高于对Anne Borg的评分。
{
"query": {
"bool": {
"must": {
"query_string": {
"query": "Ann",
"fields": ["full_name"]
}
},
"should": {
"match": {
"full_name": "Ann"}
}
}
}
}
结果如下...
"hits": [
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "2",
"_score": 6.6333585,
"_source": {
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
},
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "1",
"_score": 6.142234,
"_source": {
"firstname": "Leanne",
"lastname": "Ray"
}
},
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "3",
"_score": 6.079495,
"_source": {
"firstname": "Anne",
"lastname": "Borg"
}
}
一起使用ngram token 过滤器和ngram token 生成器似乎可以解决此问题...
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"ngram"
],
"tokenizer": "ngram"
}
}
}
},
"mappings": {
"properties": {
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"middlename": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"full_name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer"
}
}
}
}
相同的查询会以期望的相对得分带回预期的结果。 为什么起作用? 请注意,上面我使用的是带有小写过滤器的ngram标记器,这里唯一的区别是我使用的是ngram过滤器而不是小写过滤器。
这是结果。请注意,Leanne Ray的得分要低于Anne Borg和Anne M Stone。
"hits": [
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "3",
"_score": 4.953257,
"_source": {
"firstname": "Anne",
"lastname": "Borg"
}
},
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "2",
"_score": 4.87168,
"_source": {
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
},
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "1",
"_score": 1.0364896,
"_source": {
"firstname": "Leanne",
"lastname": "Ray"
}
}
顺便说一句,当索引还包含其他文档时,该查询还会带回大量误报结果。并不是这样的问题,因为相对于理想命中的得分,误报得分很低。但是仍然不理想。例如,如果我在文档中添加{firstname:Gideon,lastname:Grossma},则上面的查询也将带回结果集中的该文档-尽管得分比包含字符串“Ann”的文档低得多
最佳答案
答案与链接线程中的答案相同。由于您正在对所有索引数据进行ngram处理,因此Ann
与Anne
的工作方式相同,不过您会获得完全相同的响应(请参见下文),但得分不同:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5Jr-DHIBhYuDqANwSeiw",
"_score" : 4.8442974,
"_source" : {
"firstname" : "Anne",
"lastname" : "Borg"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5pr-DHIBhYuDqANwSeiw",
"_score" : 4.828779,
"_source" : {
"firstname" : "Anne",
"middlename" : "M",
"lastname" : "Stone"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5Zr-DHIBhYuDqANwSeiw",
"_score" : 0.12874341,
"_source" : {
"firstname" : "Leanne",
"lastname" : "Ray"
}
}
]
更新
这是一个修改后的查询,可用于检查零件(即
ann
与anne
)。同样,套管在这里没有任何区别,因为分析仪在分度之前会小写所有内容。{
"query": {
"bool": {
"must": {
"query_string": {
"query": "ann",
"fields": [
"full_name"
]
}
},
"should": [
{
"match_phrase_prefix": {
"firstname": {
"query": "ann",
"boost": "10"
}
}
},
{
"match_phrase_prefix": {
"lastname": {
"query": "ann",
"boost": "10"
}
}
}
]
}
}
}