本文介绍了Flink 和 Storm 之间的主要区别是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Flink 已经与 Spark 相比,在我看来,这是错误的比较,因为它将窗口事件处理系统与微批处理进行了比较;同样,将 Flink 与 Samza 进行比较对我来说也没有多大意义.在这两种情况下,它都会比较实时与批处理事件处理策略,即使在 Samza 的情况下规模"较小.但我想知道 Flink 与 Storm 相比如何,它在概念上似乎更相似.

Flink has been compared to Spark, which, as I see it, is the wrong comparison because it compares a windowed event processing system against micro-batching; Similarly, it does not make that much sense to me to compare Flink to Samza. In both cases it compares a real-time vs. a batched event processing strategy, even if at a smaller "scale" in the case of Samza. But I would like to know how Flink compares to Storm, which seems conceptually much more similar to it.

我发现 this(幻灯片 #4)将主要区别记录为Flink 的可调延迟".另一个提示似乎是 Slicon Angle 表明 Flink 可以更好地集成到 Spark 或 HadoopMR 世界中,但没有提及或引用实际细节.最后,Fabian Hueske 本人在在一次采访中指出与Flink 的流分析功能 Apache Storm 提供了高级 API,并使用更轻量级的容错策略来提供仅一次处理保证."

I have found this (Slide #4) documenting the main difference as "adjustable latency" for Flink. Another hint seems to be an article by Slicon Angle that suggest that Flink better integrates into a Spark or HadoopMR world, but no actual details are mentioned or referenced. Finally, Fabian Hueske himself notes in an interview that "Compared to Apache Storm, the stream analysis functionality of Flink offers a high-level API and uses a more light-weight fault tolerance strategy to provide exactly-once processing guarantees."

所有这些对我来说有点稀疏,我不太明白这一点.有人可以解释一下 Flink 完全解决了 Storm 中流处理的哪些问题?Hueske 所说的 API 问题及其更轻量级的容错策略"指的是什么?

All that is a bit sparse for me and I do not quite get the point.Can someone explain what problem(s?) with stream processing in Storm is (are?) exactly solved by Flink? What is Hueske referring to by the API issues and their "more light-weight fault tolerance strategy"?

推荐答案

免责声明:我是 Apache Flink 提交者和 PMC 成员,只熟悉 Storm 的高级设计,而不是它的内部结构.

Disclaimer: I'm an Apache Flink committer and PMC member and only familiar with Storm's high-level design, not its internals.

Apache Flink 是一个用于统一流和批处理的框架.由于包括流水线洗牌在内的并行任务之间的流水线数据传输,Flink 的运行时本机支持这两个域.记录立即从生产任务传送到接收任务(在收集到缓冲区以进行网络传输之后).可以选择使用阻塞数据传输来执行批处理作业.

Apache Flink is a framework for unified stream and batch processing. Flink's runtime natively supports both domains due to pipelined data transfers between parallel tasks which includes pipelined shuffles. Records are immediately shipped from producing tasks to receiving tasks (after being collected in a buffer for network transfer). Batch jobs can be optionally executed using blocking data transfers.

Apache Spark 是一个也支持批处理和流处理的框架.Flink 的批处理 API 看起来非常相似,并且解决了与 Spark 类似的用例,但在内部结构上有所不同.对于流媒体,两个系统都遵循非常不同的方法(小批量与流媒体),这使得它们适用于不同类型的应用程序.我会说比较 Spark 和 Flink 是有效且有用的,但是,Spark 并不是与 Flink 最相似的流处理引擎.

Apache Spark is a framework that also supports batch and stream processing. Flink's batch API looks quite similar and addresses similar use cases as Spark but differs in the internals. For streaming, both systems follow very different approaches (mini-batches vs. streaming) which makes them suitable for different kinds of applications. I would say comparing Spark and Flink is valid and useful, however, Spark is not the most similar stream processing engine to Flink.

回到最初的问题,Apache Storm 是一个没有批处理功能的数据流处理器.事实上,Flink 的流水线引擎在内部看起来有点类似于 Storm,即 Flink 的并行任务的接口类似于 Storm 的 bolts.Storm 和 Flink 的共同点是它们旨在通过流水线数据传输实现低延迟流处理.但是,与 Storm 相比,Flink 提供了更高级的 API.Flink 的 DataStream API 提供了 Map、GroupBy、Window 和 Join 等功能,而不是通过一个或多个读取器和收集器来实现 Bolt 的功能.使用 Storm 时,许多此类功能必须手动实现.另一个区别是处理语义.Storm 保证至少一次处理,而 Flink 提供恰好一次.提供这些处理保证的实现有很大不同.Storm 使用记录级确认,而 Flink 使用 Chandy-Lamport 算法的变体.简而言之,数据源会定期将标记注入数据流.每当操作员收到这样的标记时,它就会检查其内部状态.当所有数据接收器接收到一个标记时,该标记(以及之前处理过的所有记录)将被提交.如果出现故障,所有源操作符都将重置为它们看到最后提交的标记时的状态,并继续处理.这种标记检查点方法比 Storm 的记录级确认更轻量级.这个幻灯片集和相应的talk 讨论 Flink 的流处理方法,包括容错、检查点和状态处理.

Coming to the original question, Apache Storm is a data stream processor without batch capabilities. In fact, Flink's pipelined engine internally looks a bit similar to Storm, i.e., the interfaces of Flink's parallel tasks are similar to Storm's bolts. Storm and Flink have in common that they aim for low latency stream processing by pipelined data transfers. However, Flink offers a more high-level API compared to Storm. Instead of implementing the functionality of a bolts with one or more readers and collectors, Flink's DataStream API provides functions such as Map, GroupBy, Window, and Join. A lot of this functionality must be manually implemented when using Storm. Another difference are processing semantics. Storm guarantees at-least-once processing while Flink provides exactly-once. The implementations which give these processing guarantees differ quite a bit. While Storm uses record-level acknowledgments, Flink uses a variant of the Chandy-Lamport algorithm. In a nutshell, data sources periodically inject markers into the data stream. Whenever an operator receives such a marker, it checkpoints its internal state. When a marker was received by all data sinks, the marker (and all records which have been processed before) are committed. In case of a failure, all sources operators are reset to their state when they saw the last committed marker and processing is continued. This marker-checkpoint approach is more lightweight than Storm's record-level acknowledgments. This slide set and the corresponding talk discuss Flink's streaming processing approach including fault tolerance, checkpointing, and state handling.

Storm 还提供了一个名为 Trident 的一次性高级 API.然而,Trident 是基于小批量的,因此比 Flink 更类似于 Spark.

Storm also offers an exactly-once, high-level API called Trident. However, Trident is based on mini-batches and hence more similar to Spark than Flink.

Flink 的可调延迟是指 Flink 将记录从一个任务发送到另一个任务的方式.我之前说过,Flink 使用流水线数据传输并在记录产生后立即转发.为了提高效率,这些记录被收集在缓冲区中,一旦缓冲区已满或达到某个时间阈值,该缓冲区就会通过网络发送.此阈值控制记录的延迟,因为它指定了记录在不发送到下一个任务的情况下将保留在缓冲区中的最长时间.但是,它不能用来硬保证记录从进入程序到离开程序所需的时间,因为这还取决于任务内的处理时间和网络传输次数等.

Flink's adjustable latency refers to the way that Flink sends records from one task to the other. I said before, that Flink uses pipelined data transfers and forwards records as soon as they are produced. For efficiency, these records are collected in a buffer which is sent over the network once it is full or a certain time threshold is met. This threshold controls the latency of records because it specifies the maximum amount of time that a record will stay in a buffer without being sent to the next task. However, it cannot be used to give hard guarantees about the time it takes for a record from entering to leaving a program because this also depends on the processing time within tasks and the number of network transfers among other things.

这篇关于Flink 和 Storm 之间的主要区别是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-30 05:19