ElasticSearch - 高索引吞吐量

本文介绍了ElasticSearch - 高索引吞吐量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在对 ElasticSearch 进行基准测试，以实现非常高的索引吞吐量.

I'm benchmarking ElasticSearch for very high indexing throughput purposes.

我目前的目标是能够在几个小时内索引 30 亿(3,000,000,000)个文档.为此，我目前有 3 台 Windows 服务器机器，每台机器具有 16GB RAM 和 8 个处理器.被插入的文档有一个非常简单的映射，只包含少数未分析的数字字段(_all 被禁用).

My current goal is to be able to index 3 billion (3,000,000,000) documents in a matter of hours.For that purpose, I currently have 3 windows server machines, with 16GB RAM and 8 processors each.The documents being inserted have a very simple mapping, containing only a handful of numerical non analyzed fields (_all is disabled).

使用这个相对适中的设备，我能够达到每秒大约 120,000 个索引请求(使用大桌子进行监控)，并且我相信可以进一步提高吞吐量.我正在使用多个 .net NEST 客户端发送索引批量请求，批量执行 1500 个索引操作.

I am able to reach roughly 120,000 index requests per second (monitoring using big desk), using this relatively modest rig, and I'm confident that the throughput can be increased further. I'm using a number of .net NEST clients to send the index bulk requests, with 1500 index operations in bulk.

不幸的是，每秒 120k 请求的吞吐量不会持续很长时间，并且速率逐渐下降，几个小时后下降到 ~15k.

Unfortunately, the throughput of 120k requests per second does not last very long, and the rate diminishes gradually, dropping to ~15k after a couple of hours.

监控机器显示 CPU 不是瓶颈.但是，所有机器上的物理磁盘(而非 SSD)空闲时间似乎都在下降，平均空闲时间不到 15%.

Monitoring the machines reveals that the cpu's are not the bottleneck. However, physical disk (not SSD) idle time seems to be dropping on all machines, reaching less than 15% avg idleness.

将 refresh_interval 设置为 60s，而不是 300s，最后是 15m，似乎没有多大帮助.对单个分片中的单个 translog 进行监视，结果表明该 translog 每 30 分钟刷新一次，然后达到 200MB.

Setting refresh_interval to 60s, than to 300s, and finally 15m, didn't seem to help much.Spying on a single translog in a single shard, showed that the translog is flushed every 30 minutes, before reaching 200MB.

我尝试使用两种分片策略:

I have tried using two sharding strategies:

1 个索引，包含 60 个分片(无副本).
3 个索引，每个索引有 20 个分片(无副本).

两次尝试都产生了相当相似的体验，我想这是有道理的，因为它们的分片数量相同.

Both attempts result in rather similar experience, which i guess makes sense since it's the same number of shards.

查看段，我可以看到大多数分片有大约 30 个已提交的段，以及相似数量的可搜索段.段大小各不相同.有一次，尝试用max_num_segments=1优化索引，完成后似乎有点帮助(花了很长时间).

Looking at the segments, I can see that most shards have ~30 committed segments, and similar number of searchable segments as well. Segment size varies. At one time, an attempt to optimize the index with max_num_segments=1, seemed to have help a little after it was finished (took a long while).

在删除使用过的索引并创建新索引后，随时从头开始整个摄取过程 - 导致相同的行为.最初的索引吞吐量很高，但逐渐减少，远未达到 30 亿文档的目标.当时的索引大小约为120GB.

At any time, starting the whole ingestion process from the start, after deleting the used indices and creating new ones - result in the same behavior. Initially high index throughput, but gradually diminishing, long before reaching the goal of 3 billion documents. The index size in that time is about 120GB.

我使用的是 ElasticSearch 1.4 版本.Xms 和 Xmx 配置为 8192MB，可用内存的 50%.索引缓冲区设置为 30%.

I'm using ElasticSearch 1.4 version. Xms and Xmx are configured for 8192MB, 50% of available memory. Indexing buffer is set to 30%.

我的问题如下:

假设磁盘是目前这台钻机的瓶颈，这种磁盘利用率逐渐增加的现象是正常的吗?如果没有，可以采取哪些措施来消除这些影响?
是否可以进行任何微调来提高索引吞吐量?我是不是该?或者我应该直接扩展.

推荐答案

长话短说，我最终得到了 5 台虚拟 linux 机，8 个 cpu，16 GB，使用 puppet 部署 elasticsearch.我的文件变大了一点，但吞吐量也变大了(稍微).我能够平均每秒达到 150K 索引请求，在 2 小时内索引 10 亿个文档.吞吐量不是恒定的，我观察到与以前类似的吞吐量递减行为，但程度较小.由于我将对相同数量的数据使用每日索引，因此我希望这些性能指标每天都大致相似.

Long story short, I ended up with 5 virtual linux machines, 8 cpu, 16 GB, using puppet to deploy elasticsearch.My documents got a little bigger, but so did the throuhgput rate (slightly).I was able to reach 150K index requests / second on average, indexing 1 billion documents in 2 hours.Throughput is not constant, and i observed similar diminishing throughput behavior as before, but to a lesser extent. Since I will be using daily indices for same amount of data, I would expect these performance metrics to be roughly similar every day.

从 windows 机器过渡到 linux 主要是因为方便和符合 IT 惯例.虽然我不确定，但我怀疑在 Windows 上也可以实现相同的结果.

The transition from windows machines to linux was primarily due to convenience and compliance with IT conventions. Though i don't know for sure, I suspect the same results could be achieved on windows as well.

在我的几次试验中，我尝试索引而不像 Christian Dahlqvist 建议的那样指定文档 ID.结果令人惊讶.我观察到显着吞吐量增加，在某些情况下达到 300k 甚至更高.结论很明显:不要指定文档 ID，除非您绝对必须这样做.

In several of my trials I attempted indexing without specifying document ids as Christian Dahlqvist suggested. The results were astonishing. I observed a significant throughput increase, reaching 300k and higher in some cases. The conclusion of this is obvious: Do not specify document ids, unless you absolutely have to.

此外，我每台机器使用的分片更少，这也有助于增加吞吐量.

Also, i'm using less shards per machine, which also contributed to throughput increase.

这篇关于ElasticSearch - 高索引吞吐量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！