java - Elasticsearch 交叉场，边缘ngram分析器

我有999个用于 Elasticsearch 实验的文档。

我的类型映射中有一个字段f4，将对其进行分析，并为分析器进行以下设置:

  "myNGramAnalyzer" => [
       "type" => "custom",
        "char_filter" => ["html_strip"],
        "tokenizer" => "standard",
        "filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"]
  ]

我的过滤器如下:

  "filter" => [
        "ngram_filter" => [
            "type" => "edgeNGram",
            "min_gram" => "2",
            "max_gram" => "20"
        ]
  ]

我对字段f4的值为“Proj1”，“Proj2”，“Proj3” ......等等。

现在，当我尝试使用交叉字段搜索“proj1”字符串时，我期望带有“Proj1”的文档将以最大得分返回到响应的顶部。但事实并非如此。其余所有数据的内容几乎相同。

我也不明白为什么它匹配所有999个文档？

以下是我的搜索:

{
    "index": "myindex",
    "type": "mytype",
    "body": {
        "query": {
            "multi_match": {
                "query": "proj1",
                "type": "cross_fields",
                "operator": "and",
                "fields": "f*"
            }
        },
        "filter": {
            "term": {
                "deleted": "0"
            }
        }
    }
}

我的搜索结果是:

{
    "took": 12,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 999,
        "max_score": 1,
        "hits": [{
            "_index": "myindex",
            "_type": "mytype",
            "_id": "42",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "125650","f3": "BH.1511AI.001",
                "f4": "Proj42",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        }, {
            "_index": "myindex",
            "_type": "mytype",
            "_id": "47",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "137946","f3": "BH.152096.001",
                "f4": "Proj47",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        },
        //.......
        //.......
        //MANY RECORDS IN BETWEEN HERE
        //.......
        //.......
        {
            "_index": myindex,
            "_type": "mytype",
            "_id": "1",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "142095","f3": "BH.705215.001",
                "f4": "Proj1",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        //.......
        //.......
        //MANY RECORDS IN BETWEEN HERE
        //.......
        //.......
        }]
    }
}

我做错了什么还是想念什么？ (对于冗长的问题，我们深表歉意，但我认为应该提供所有可能的信息，并丢弃不必要的其他代码)。

编辑:

术语 vector 响应

{
    "_index": "myindex",
    "_type": "mytype",
    "_id": "10",
    "_version": 1,
    "found": true,
    "took": 9,
    "term_vectors": {
        "f4": {
            "field_statistics": {
                "sum_doc_freq": 5886,
                "doc_count": 999,
                "sum_ttf": 5886
            },
            "terms": {
                "pr": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "pro": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj1": {
                    "doc_freq": 111,
                    "ttf": 111,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj10": {
                    "doc_freq": 11,
                    "ttf": 11,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                }
            }
        }
    }
}

编辑2

字段f4的映射

"f4" : {
    "type" : "string",
    "index_analyzer" : "myNGramAnalyzer",
    "search_analyzer" : "standard"
}

我已更新为使用标准分析器查询时间，这虽然改善了结果，但仍达不到我的预期。

而不是999(所有文档)现在返回111文档，例如“Proj1”，“Proj11”，“Proj111” ......“Proj1”，“Proj181” .........等。

仍然“Proj1”位于结果之间，而不是顶部。

最佳答案

没有index_analyzer(至少从Elasticsearch版本1.7起没有)。对于mapping parameters，可以使用analyzer和search_analyzer。
请尝试以下步骤以使其起作用。

使用分析器设置创建myindex:

PUT /myindex
{
   "settings": {
     "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "edge_ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "myNGramAnalyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "char_filter": "html_strip",
               "filter": [
                  "lowercase",
                  "standard",
                  "asciifolding",
                  "stop",
                  "snowball",
                  "ngram_filter"
               ]
            }
         }
      }
   }
}

将映射添加到mytype(为了简短起见，我仅映射了相关字段):

PUT /myindex/_mapping/mytype
{
   "properties": {
      "f1": {
         "type": "string"
      },
      "f4": {
         "type": "string",
         "analyzer": "myNGramAnalyzer",
         "search_analyzer": "standard"
      },
      "deleted": {
         "type": "string"
      }
   }
}

索引一些数据:

PUT myindex/mytype/1
{
    "f1":"396",
    "f4":"Proj12" ,
    "deleted": "0"
}

PUT myindex/mytype/2
{
    "f1":"42",
    "f4":"Proj22" ,
    "deleted": "1"
}

现在尝试查询:

GET myindex/mytype/_search
{
   "query": {
      "multi_match": {
         "query": "proj1",
         "type": "cross_fields",
         "operator": "and",
         "fields": "f*"
      }
   },
   "filter": {
      "term": {
         "deleted": "0"
      }
   }
}

它应该返回文件#1。它对我来说适合Sense。我正在使用Elasticsearch 2.X版本。

希望我能帮助到我:)