elasticsearch - 适用于英国邮政编码的Elasticsearch映射，能够处理间距和大小写

我正在寻找带有英国邮政编码的Elasticsearch 7的映射/分析器设置。我们不需要任何模糊运算符，但应该能够处理大写字母和间距的差异。

一些例子:

查询字符串:“SN13 9ED”应该返回:

sn139ed

SN13 9ED

Sn13 9ed

但不应返回:

SN13 1EP

SN131EP

默认情况下使用关键字分析器，这似乎对间距问题敏感，但对大写字母不敏感。除非我们将查询指定为SN13 1EP，否则它还将返回SN13 AND 9ED的匹配项。

另外，使用关键字分析器，对SN13 9ED的查询返回的SN13 1EP结果的相关性高于SN13 9ED，即使这应该是完全匹配。为什么同一字符串中的2个匹配项比仅1个匹配项的相关性更低？

邮政编码映射

"post_code": {
    "type": "text",
    "fields": {
        "keyword": {
            "type": "keyword",
            "ignore_above": 256
        }
    }
},

查询

  "query" => array:1 [▼
    "query_string" => array:1 [▼
      "query" => "KT2 7AJ"
    ]
  ]

最佳答案

我相信根据我的评论，当您的搜索字符串为SN13 1EP时，您可能已经能够过滤出SN13 9ED。

希望您了解 Analysis 是什么， Analyzers 如何在 text 字段上工作，以及默认情况下如何将 Standard Analyzer 应用于 token ，然后才将它们最终存储在反向索引中。请注意，这仅适用于text字段。

查看您的映射，如果您曾使用过post_code而不是post_code.keyword的搜索，我相信大写将得到解决，因为默认情况下，针对text字段的ES使用Standard Analyzer，这意味着您的 token 最终将以小写格式保存在索引中，甚至在查询时也是如此，ES在查询期间，分析器将在反向索引中搜索之前应用。

请注意，默认情况下，在索引时间以及该字段的搜索时间期间，将应用与映射中配置的分析器相同的分析器

对于具有，sn131ep，的场景，我所做的工作是使用Pattern Capture Token Filter，其中我指定了一个正则表达式，它将 token 分割成长度4和3的两个，从而将它们保存在倒排索引中，在这种情况下将是sn13和1ep。在将它们存储在反向索引中之前，我还会对其进行小写。

请注意，我为您的邮政编码添加的方案是它的大小是固定的，即7个字符。如果不是，则可以添加更多模式

请参阅以下详细信息:

对应:

PUT my_postcode_index
{
 "settings" : {
    "analysis" : {
       "filter" : {
          "mypattern" : {
             "type" : "pattern_capture",
             "preserve_original" : true,
             "patterns" : [
                "(\\w{4}+)|(\\w{3}+)",             <--- Note this and feel free to add more patterns
                "\\s"                              <--- Filter based on whitespace
             ]
          }
       },
       "analyzer" : {
          "my_analyzer" : {
             "tokenizer" : "pattern",
             "filter" : [ "mypattern", "lowercase" ]   <--- Note the lowercase here
          }
       }
    }
 },
  "mappings": {
    "properties": {
      "postcode":{
        "type": "text",
        "analyzer": "my_analyzer",                 <--- Note this
        "fields":{
          "keyword":{
            "type": "keyword"
          }
        }
      }
    }
  }
}

样本文件:

POST my_postcode_index/_doc/1
{
  "postcode": "SN131EP"
}

POST my_postcode_index/_doc/2
{
  "postcode": "sn13 1EP"
}

POST my_postcode_index/_doc/3
{
  "postcode": "sn131ep"
}

请注意，这些文档在语义上是相同的。

请求查询:

POST my_postcode_index/_search
{
  "query": {
    "query_string": {
      "default_field": "postcode",
      "query": "SN13 1EP",
      "default_operator": "AND"
    }
  }
}

响应:

{
  "took" : 24,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.6246513,
    "hits" : [
      {
        "_index" : "my_postcode_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6246513,
        "_source" : {
          "postcode" : "SN131EP"
        }
      },
      {
        "_index" : "my_postcode_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.6246513,
        "_source" : {
          "postcode" : "sn131ep"
        }
      },
      {
        "_index" : "my_postcode_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.5200585,
        "_source" : {
          "postcode" : "sn13 1EP"
        }
      }
    ]
  }
}

请注意，即使查询snp131p和snp13 1ep，也会返回所有三个文档。

附加说明:

您可以使用Analyze API找出为特定文本创建的 token

POST my_postcode_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "sn139ed"
}

您将在下面看到倒置索引中存储了哪些 token 。

{
  "tokens" : [
    {
      "token" : "sn139ed",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "sn13",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "9ed",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    }
  ]
}

也:

您可能还想阅读有关Ngram Tokenizer的信息。我建议您同时使用这两种解决方案，并查看最适合您的输入的解决方案。

请对其进行测试，如果您有任何疑问，请告诉我们。