elasticsearch - Google样式与Elasticsearch的自动完成和自动更正

我正在尝试通过Elasticsearch实现Google样式的自动完成和自动更正。

映射:

POST music
{
  "settings": {
    "analysis": {
      "filter": {
        "nGram_filter": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "symbol"
          ]
        }
      },
      "analyzer": {
        "nGram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "nGram_filter"
          ]
        },
        "whitespace_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "song": {
      "properties": {
        "song_field": {
          "type": "string",
          "analyzer": "nGram_analyzer",
          "search_analyzer": "whitespace_analyzer"
        },
        "suggest": {
          "type": "completion",
          "analyzer": "simple",
          "search_analyzer": "simple",
          "payloads": true
        }
      }
    }
  }
}

文件:

POST music/song
{
  "song_field" : "beautiful queen",
  "suggest" : "beautiful queen"
}

POST music/song
{
  "song_field" : "beautiful",
  "suggest" : "beautiful"
}

我希望当用户键入:“beaatiful q”时，他会得到类似beautiful queen的信息(将美丽纠正为美丽，并且q完成为女王)。

我试过以下查询:

POST music/song/_search?search_type=dfs_query_then_fetch
{
  "size": 10,
  "suggest": {
    "didYouMean": {
      "text": "beaatiful q",
      "completion": {
        "field": "suggest"
      }
    }
  },
  "query": {
    "match": {
      "song_field": {
        "query": "beaatiful q",
         "fuzziness": 2
      }
    }
  }
}

不幸的是，Completion suggester不允许输入任何错字，因此我得到以下响应:

"suggest": {
    "didYouMean": [
      {
        "text": "beaatiful q",
        "offset": 0,
        "length": 11,
        "options": []
      }
    ]
  }

另外，搜索给了我这些结果(尽管用户开始写“queen”，但美丽的排名更高):

"hits": [
      {
        "_index": "music",
        "_type": "song",
        "_id": "AVUj4Y5NancUpEdFLeLo",
        "_score": 0.51315063,
        "_source": {
          "song_field": "beautiful"
          "suggest": "beautiful"
        }
      },
      {
        "_index": "music",
        "_type": "song",
        "_id": "AVUj4XFAancUpEdFLeLn",
        "_score": 0.32071912,
        "_source": {
          "song_field": "beautiful queen"
          "suggest": "beautiful queen"
        }
      }
    ]

更新!!!

我发现可以将模糊查询与完成提示器一起使用，但是现在查询时没有任何建议(模糊仅支持2个编辑距离):

POST music/song/_search
{
  "size": 10,
  "suggest": {
    "didYouMean": {
      "text": "beaatefal q",
      "completion": {
        "field": "suggest",
        "fuzzy" : {
                "fuzziness" : 2
            }
      }
    }
  }
}

我仍然希望“beautiful queen”作为建议响应。

最佳答案

当您想提供2个或更多单词作为搜索建议时，我发现(困难的方式)，在Elasticsearch中使用ngrams或edgengrams不值得。

使用Shingles token filter和shingles analyzer将为您提供多个单词的短语，如果将其与match_phrase_prefix结合使用，它将为您提供所需的功能。

基本上是这样的:

    PUT /my_index
{
    "settings": {
        "number_of_shards": 1,
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2,
                    "max_shingle_size": 2,
                    "output_unigrams":  false
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter"
                    ]
                }
            }
        }
    }
}

并且不要忘记进行映射:

{
"my_type": {
    "properties": {
        "title": {
            "type": "string",
            "fields": {
                "shingles": {
                    "type":     "string",
                    "analyzer": "my_shingle_analyzer"
                }
            }
        }
    }
}

}

Ngrams和Edgengrams将标记单个字符，而Shingles分析器和过滤器将字母(制作单词)分组，并提供了一种更高效的生成和搜索短语的方式。我花了很多时间弄乱上面的2，直到看到Shingles提到并继续阅读。好多了。