问题描述
我们使用的是 Elasticsearch 6.8.4 和 Flink 1.0.18.
We are using Elasticsearch 6.8.4 and Flink 1.0.18.
我们在 elasticsearch 中有一个包含 1 个分片和 1 个副本的索引,我想创建自定义输入格式以使用具有 1 个以上输入拆分的 apache Flink 数据集 API 在 elasticsearch 中读取和写入数据,以实现更好的性能.那么有什么办法可以达到这个要求吗?
We have an index with 1 shard and 1 replica in elasticsearch and I want to create the custom input format to read and write data in elasticsearch using apache Flink dataset API with more than 1 input splits in order to achieve better performance. so is there any way I can achieve this requirement?
注意:每个文档的大小更大(将近 8mb),由于大小限制和每个阅读请求,我一次只能读取 10 个文档,我们想要检索 500k 记录.
Note: Per document size is larger(almost 8mb) and I can read only 10 documents at a time because of size constraint and per reading request, we want to retrieve 500k records.
根据我的理解,no.of parallelism 应该等于数据源的分片/分区数.然而,由于我们只存储了少量数据,因此我们将分片的数量保持为 1,并且我们有一个静态数据,它每月略有增加.
As per my understanding, no.of parallelism should be equal to number of shards/partitions of the data source. however, since we store only a small amount of data we have kept the number of shard as only 1 and we have a static data it gets increased very slightly per month.
非常感谢任何帮助或源代码示例.
Any help or example of source code will be much appreciated.
推荐答案
您需要能够生成对 ES 的查询,以将源数据有效地划分为相对相等的块.然后,您可以并行运行您的输入源 >1、每个子任务只读取部分索引数据.
You need to be able to generate queries to ES that effectively partition your source data into relatively equal chunks. Then you can run your input source with a parallelism > 1, and have each sub-task read only part of the index data.
这篇关于使用 Flink Rich InputFormat 创建 Elasticsearch 的输入格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!