在Elasticsearch中导入和更新数据 | 在Elasticsearch中导入和更新数据

本文介绍了在Elasticsearch中导入和更新数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们有一个现有的搜索功能，涉及SQL Server中多个表的数据。这会导致我们的DB上的负载很重，所以我试图找到一个更好的方式来搜索这些数据（它不会经常变化）。我已经使用Logstash和Elasticsearch大约一个星期使用导入包含120万条记录。

We have an existing search function that involves data across multiple tables in SQL Server. This causes a heavy load on our DB, so I'm trying to find a better way to search through this data (it doesn't change very often). I have been working with Logstash and Elasticsearch for about a week using an import containing 1.2 million records. My question is essentially, "how do I update existing documents using my 'primary key'"?

CSV数据文件（以管道分隔）如下所示：

CSV data file (pipe delimited) looks like this:

369|90045|123 ABC ST|LOS ANGELES|CA
368|90045|PVKA0010|LA|CA
367|90012|20000 Venice Boulvd|Los Angeles|CA
365|90045|ABC ST 123|LOS ANGELES|CA
363|90045|ADHOCTESTPROPERTY|DALES|CA

我的logstash配置如下：

My logstash config looks like this:

input {
  stdin {
    type => "stdin-type"
  }

  file {
    path => ["C:/Data/sample/*"]
    start_position => "beginning"
  }
}

filter {
  csv {
    columns => ["property_id","postal_code","address_1","city","state_code"]
    separator => "|"
  }
}

output {
  elasticsearch {
    embedded => true
    index => "samples4"
    index_type => "sample"
  }
}

在弹性搜索中的文档如下： / p>

A document in elasticsearch, then looks like this:

{
   "_index": "samples4",
   "_type": "sample",
   "_id": "64Dc0_1eQ3uSln_k-4X26A",
   "_score": 1.4054651,
   "_source": {
   "message": [
      "369|90045|123 ABC ST|LOS ANGELES|CA\r"
   ],
   "@version": "1",
   "@timestamp": "2014-02-11T22:58:38.365Z",
   "host": "[host]",
   "path": "C:/Data/sample/sample.csv",
   "property_id": "369",
   "postal_code": "90045",
   "address_1": "123 ABC ST",
   "city": "LOS ANGELES",
   "state_code": "CA"
}

我想想要 _id 字段，替换为 property_id 的值。想法是后续数据文件将包含更新。我不需要保留以前的版本，也不会有从文档中添加或删除键的情况。

I think would like the unique ID in the _id field, to be replaced with the value of property_id. The idea is that subsequent data files would contain updates. I don't need to keep previous versions and there wouldn't be a case where we added or removed keys from a document.

document_id 设置为elasticsearch输出不将该字段的值放入 _id （它只是放在property_id，只存储/更新一个文档）。我知道我在这里缺少一些东西。我只是采取错误的方法？

The document_id setting for elasticsearch output doesn't put that field's value into _id (it just put in "property_id" and only stored/updated one document). I know I'm missing something here. Am I just taking the wrong approach?

编辑：工作！

使用@ rutter的建议输出配置为：

Using @rutter's suggestion, I've updated the output config to this:

output {
  elasticsearch {
    embedded => true
    index => "samples6"
    index_type => "sample"
    document_id => "%{property_id}"
  }
}

现在文件正在按照预期将新文件放入数据文件夹进行更新。 _id 和 property_id 是相同的值。

Now documents are updating by dropping new files into the data folder as expected. _id and property_id are the same value.

{
   "_index": "samples6",
   "_type": "sample",
   "_id": "351",
   "_score": 1,
   "_source": {
   "message": [
      "351|90045|Easy as 123 ST|LOS ANGELES|CA\r"
   ],
   "@version": "1",
   "@timestamp": "2014-02-12T16:12:52.102Z",
   "host": "TXDFWL3474",
   "path": "C:/Data/sample/sample_update_3.csv",
   "property_id": "351",
   "postal_code": "90045",
   "address_1": "Easy as 123 ST",
   "city": "LOS ANGELES",
   "state_code": "CA"
}

`推荐答案`

从注释转换：

您可以通过发送具有相同ID的其他文档来覆盖文档，但这可能对您之前的数据很棘手，因为默认情况下会获得随机ID。

You can overwrite a document by sending another document with the same ID... but that might be tricky with your previous data, since you'll get randomized IDs by default.

您可以使用输出插件的，但它需要一个字符串，而不是字段名称。要使用字段的内容，您可以使用，例如％{property_id} 。

You can set an ID using the output plugin's document_id field, but it takes a literal string, not a field name. To use a field's contents, you could use an sprintf format string, such as %{property_id}.

例如：

output {
  elasticsearch {
    ... other settings...
    document_id => "%{property_id}"
  }
}

                        这篇关于在Elasticsearch中导入和更新数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！