问题描述
许多用于无服务器实时分析的AWS参考架构建议通过Kinesis Firehose将处理后的数据从Lambda推送到S3.
Many AWS reference architectures for serverless real-time analytics, suggest pushing processed data from Lambda to S3 through Kinesis Firehose.
为什么我们不能将数据从Lambda直接推送到S3?通过跳过调解器Kinesis Firehose组件来避免复杂性和额外成本,这不是更好吗?Lambda将实时数据直接写入S3是否有问题?
Why can’t we push data from Lambda to S3 directly? Isn't it better to avoid complexity and additional cost by skipping the mediator Kinesis Firehose component? Is there any problem with writing real-time data by Lambda directly to S3?
推荐答案
主要是因为Firehose使您能够批处理数据.它将例如仅写入压缩到S3中的128mb数据文件.它将收集传入的数据,直到达到阈值,然后将其写入S3并等待下一个数据.如果您让lambda直接写入S3,则您必须自己进行批处理,如果您只有无状态的lambda,这将非常困难.
Mainly because Firehose enables you to batch the data. It will e.g. only write files of 128mb of data gzipped into S3. It will collect incoming data until a threshold is reached, write it to S3 and wait for the next data. If you let the lambda write to S3 directly then you would have to do the batching yourself, which is pretty difficult if you only have state-less lambdas.
话虽如此,这主要适用于您的数据由许多记录/行组成的情况.另一方面,如果您基本上是在处理lambda输出的大量数据,例如50MB数据,那么您可以/应该直接写入S3,因为在您的情况下可能无法进行批量处理或有用.
That being said this mainly applies if your data consists of MANY records / rows. If on the other hand you are basically dealing with blobs of lets say 50MB of data that your lambda outputs then you can / should write to S3 directly because the batching may not be possible or useful in your case.
是否应该使用firehose仅取决于您拥有的数据/吞吐量以及可能有的要求.
Wether or not you should use firehose simply depends on what data / throughput you have and what requirements there may be.
直接将实时数据写入S3的一个问题是,如果您想如果您有数百万个文件大小的字节而不是100个文件大小的10s MB,那么用Athena对其进行查询将给您带来很多麻烦.
One problem of writing real time data to S3 directly is that if you want to e.g. query it with Athena you will get into a lot of trouble if you have millions of files a few bytes large instead of 100s of files 10s of MB large.
这篇关于在近乎实时的分析中,为什么Lambda-> Firehose-> S3比Lambda-> S3更受青睐?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!