database - Elasticsearch 列的唯一过滤器不起作用(插入重复项)

我已经将contactNumber字段修改为具有unique过滤器

通过如下更新索引设置

curl -XPUT localhost:9200/test-index2/_settings -d '
{
     "index":{
        "analysis":{
           "analyzer":{
              "unique_keyword_analyzer":{
         "only_on_same_position":"true",
                 "filter":"unique"
              }
           }
        }
  },
  "mappings":{
     "business":{
        "properties":{
           "contactNumber":{
              "analyzer":"unique_keyword_analyzer",
              "type":"string"
           }
        }
     }
  }
}'

一个样例项目看起来像这样，

doc_type:"Business"

contactNumber:"(+12)415-3499"
name:"Sam's Pizza"
address:"Somewhere on earth"

筛选器不起作用，因为插入了重复项，我想否两个具有相同contactNumber的文档

在上面，我还设置了only_on_same_position-> true，以便现有的重复值将被截断/删除

我在设置中做错了什么？

最佳答案

这是Elasticsearch无法为您提供的开箱即用的东西……您需要在应用程序中提供此唯一性功能。我能想到的唯一想法是将电话号码作为文档本身的_id，每当您插入/更新某些内容时，ES都会使用contactNumber作为_id，并将该文档与已经存在的文档相关联或创建一个新的一个。

例如:

PUT /test-index2
{
  "mappings": {
    "business": {
      "_id": {
        "path": "contactNumber"
      },
      "properties": {
        "contactNumber": {
          "type": "string",
          "analyzer": "keyword"
        },
        "address": {
          "type": "string"
        }
      }
    }
  }
}

然后，您索引一些内容:

POST /test-index2/business
{
  "contactNumber": "(+12)415-3499",
  "address": "whatever 123"
}

取回它:

GET /test-index2/business/_search
{
  "query": {
    "match_all": {}
  }
}

看起来像这样:

   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "test-index2",
            "_type": "business",
            "_id": "(+12)415-3499",
            "_score": 1,
            "_source": {
               "contactNumber": "(+12)415-3499",
               "address": "whatever 123"
            }
         }
      ]
   }

您在此处看到文档的_id是电话号码本身。如果要更改或插入另一个文档(地址不同，会有一个新字段-whatever_field-但contactNumber相同):

POST /test-index2/business
{
  "contactNumber": "(+12)415-3499",
  "address": "whatever 123 456",
  "whatever_field": "whatever value"
}

Elasticserach“更新”现有文档并通过以下方式回复:

{
   "_index": "test-index2",
   "_type": "business",
   "_id": "(+12)415-3499",
   "_version": 2,
   "created": false
}

created是false，这表示文档已更新，而不是创建。 _version是2，它再次表示文档已更新。 _id是电话号码本身，指示这是已更新的文档。

再次在索引中查看，ES将存储以下内容:

  "hits": [
     {
        "_index": "test-index2",
        "_type": "business",
        "_id": "(+12)415-3499",
        "_score": 1,
        "_source": {
           "contactNumber": "(+12)415-3499",
           "address": "whatever 123 456",
           "whatever_field": "whatever value"
        }
     }
  ]

因此，新字段在那里，地址已更改，contactNumber和_id完全相同。

关于database - Elasticsearch 列的唯一过滤器不起作用(插入重复项)，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/31400041/