本文介绍了在交付给S3之前,可以在Kinesis Firehose中自定义分区吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Firehose流,旨在吸收来自不同来源和不同事件类型的数百万个事件.该流应将所有数据作为原始的\未更改的数据存储到一个S3存储桶中.

I have a Firehose stream that is intended to ingest millions of events from different sources and of different event-types. The stream should deliver all data to one S3 bucket as a store of raw\unaltered data.

我当时正在考虑根据事件消息中嵌入的元数据(例如事件源,事件类型和事件日期)在S3中对该数据进行分区.

I was thinking of partitioning this data in S3 based on metadata embedded within the event message like event-souce, event-type and event-date.

但是,Firehose根据记录到达时间遵循其默认分区.是否可以根据我的需要自定义此分区行为?

However, Firehose follows its default partitioning based on record arrival time. Is it possible to customize this partitioning behavior to fit my needs?

推荐答案

否.您不能基于事件内容进行分区".

No. You cannot 'partition' based upon event content.

一些选项是:

  • 发送到单独的Firehose流
  • 发送到Kinesis数据流(而不是Firehose),并编写自己的自定义Lambda函数来处理和保存数据(请参阅: AWS开发人员论坛:Athena和Kinesis Firehose )
  • 使用Kinesis Analytics处理消息并将其定向"到不同的Firehose流中

如果要将输出与Amazon Athena或Amazon EMR一起使用,则还可以考虑将其转换为Parquet格式,该格式具有效果更好.这将需要对S3中的数据进行批处理,而不是在数据到达流中时对其进行转换.

If you are going to use the output with Amazon Athena or Amazon EMR, you could also consider converting it into Parquet format, which has much better performance. This would require post-processing of the data in S3 as a batch rather than converting the data as it arrives in a stream.

这篇关于在交付给S3之前,可以在Kinesis Firehose中自定义分区吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-12 17:27