问题描述
我们现有的搜索功能涉及 SQL Server 中跨多个表的数据.这给我们的数据库造成了沉重的负载,所以我试图找到一种更好的方法来搜索这些数据(它不会经常改变).我使用包含 120 万条记录的导入使用 Logstash 和 Elasticsearch 大约一个星期.我的问题本质上是如何使用我的‘主键’更新现有文档"?
CSV 数据文件(以管道分隔)如下所示:
369|90045|123 ABC ST|洛杉矶|CA368|90045|PVKA0010|洛杉矶|加州367|90012|20000 威尼斯大道|洛杉矶|加利福尼亚州365|90045|ABC ST 123|洛杉矶|加利福尼亚363|90045|ADHOCTESTPROPERTY|DALES|CA
我的 logstash 配置如下:
输入{标准输入{类型 => 标准输入类型"}文件 {路径 => ["C:/Data/sample/*"]start_position => "开始"}}筛选 {csv {列 => ["property_id","postal_code","address_1","city","state_code"]分隔符 => "|"}}输出 {弹性搜索{嵌入 => 真索引 => "samples4"index_type =>样本"}}
elasticsearch 中的一个文档,然后看起来像这样:
{"_index": "samples4","_type": "样本","_id": "64Dc0_1eQ3uSln_k-4X26A",_score":1.4054651,_来源": {信息": ["369|90045|123 ABC ST|洛杉矶|CA "],"@version": "1","@timestamp": "2014-02-11T22:58:38.365Z","host": "[host]","path": "C:/Data/sample/sample.csv","property_id": "369","postal_code": "90045","address_1": "123 ABC ST","city": "洛杉矶","state_code": "CA"}
我认为希望将 _id
字段中的唯一 ID 替换为 property_id
的值.这个想法是后续的数据文件将包含更新.我不需要保留以前的版本,也不会有我们在文档中添加或删除密钥的情况.
elasticsearch 输出的 document_id
设置不会将该字段的值放入 _id
(它只是放入property_id"并且只存储/更新一个文档).我知道我在这里遗漏了一些东西.我只是采取了错误的方法吗?
工作!
使用@rutter 的建议,我已将 output
配置更新为:
输出{弹性搜索{嵌入 => 真索引 => "samples6"index_type =>样本"document_id => "%{property_id}"}}
现在文档正在更新,按预期将新文件放入数据文件夹中._id
和 property_id
是相同的值.{"_index": "samples6","_type": "样本","_id": "351","_score": 1,_来源": {信息": ["351|90045|简单如 123 ST|洛杉矶|CA"],"@version": "1","@timestamp": "2014-02-12T16:12:52.102Z","host": "TXDFWL3474","path": "C:/Data/sample/sample_update_3.csv","property_id": "351","postal_code": "90045","address_1": "像 123 ST 一样简单","city": "洛杉矶","state_code": "CA"}
从评论转换:
您可以通过发送具有相同 ID 的另一个文档来覆盖一个文档……但这对于您以前的数据可能会很棘手,因为默认情况下您会获得随机 ID.
您可以使用输出插件的 document_id
字段设置 ID,但它需要一个文字字符串,而不是一个字段名称.要使用字段的内容,您可以使用 sprintf 格式字符串,例如 %{property_id}
.
类似这样的东西,例如:
输出{弹性搜索{...其他设置...文档 ID =>%{property_id}"}}
We have an existing search function that involves data across multiple tables in SQL Server. This causes a heavy load on our DB, so I'm trying to find a better way to search through this data (it doesn't change very often). I have been working with Logstash and Elasticsearch for about a week using an import containing 1.2 million records. My question is essentially, "how do I update existing documents using my 'primary key'"?
CSV data file (pipe delimited) looks like this:
369|90045|123 ABC ST|LOS ANGELES|CA
368|90045|PVKA0010|LA|CA
367|90012|20000 Venice Boulvd|Los Angeles|CA
365|90045|ABC ST 123|LOS ANGELES|CA
363|90045|ADHOCTESTPROPERTY|DALES|CA
My logstash config looks like this:
input {
stdin {
type => "stdin-type"
}
file {
path => ["C:/Data/sample/*"]
start_position => "beginning"
}
}
filter {
csv {
columns => ["property_id","postal_code","address_1","city","state_code"]
separator => "|"
}
}
output {
elasticsearch {
embedded => true
index => "samples4"
index_type => "sample"
}
}
A document in elasticsearch, then looks like this:
{
"_index": "samples4",
"_type": "sample",
"_id": "64Dc0_1eQ3uSln_k-4X26A",
"_score": 1.4054651,
"_source": {
"message": [
"369|90045|123 ABC ST|LOS ANGELES|CA
"
],
"@version": "1",
"@timestamp": "2014-02-11T22:58:38.365Z",
"host": "[host]",
"path": "C:/Data/sample/sample.csv",
"property_id": "369",
"postal_code": "90045",
"address_1": "123 ABC ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
I think would like the unique ID in the _id
field, to be replaced with the value of property_id
. The idea is that subsequent data files would contain updates. I don't need to keep previous versions and there wouldn't be a case where we added or removed keys from a document.
The document_id
setting for elasticsearch output doesn't put that field's value into _id
(it just put in "property_id" and only stored/updated one document). I know I'm missing something here. Am I just taking the wrong approach?
EDIT: WORKING!
Using @rutter's suggestion, I've updated the output
config to this:
output { elasticsearch { embedded => true index => "samples6" index_type => "sample" document_id => "%{property_id}" } }
Now documents are updating by dropping new files into the data folder as expected. _id
and property_id
are the same value.
{ "_index": "samples6", "_type": "sample", "_id": "351", "_score": 1, "_source": { "message": [ "351|90045|Easy as 123 ST|LOS ANGELES|CA " ], "@version": "1", "@timestamp": "2014-02-12T16:12:52.102Z", "host": "TXDFWL3474", "path": "C:/Data/sample/sample_update_3.csv", "property_id": "351", "postal_code": "90045", "address_1": "Easy as 123 ST", "city": "LOS ANGELES", "state_code": "CA" }
Converting from comment:
You can overwrite a document by sending another document with the same ID... but that might be tricky with your previous data, since you'll get randomized IDs by default.
You can set an ID using the output plugin's document_id
field, but it takes a literal string, not a field name. To use a field's contents, you could use an sprintf format string, such as %{property_id}
.
Something like this, for example:
output {
elasticsearch {
... other settings...
document_id => "%{property_id}"
}
}
这篇关于在 Elasticsearch 中导入和更新数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!