们是否需要在Spark结构化流传输中同时检查Kafka的read

们是否需要在Spark结构化流传输中同时检查Kafka的read

本文介绍了我们是否需要在Spark结构化流传输中同时检查Kafka的readStream和writeStream?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们是否需要在Spark结构化流中同时检查Kafka的readStream和writeStream?我们什么时候需要检查这两个流或仅这些流之一?

Do we need to checkpoint both readStream and writeStream of Kafka in Spark Structured Streaming ? When do we need to checkpoint both of these streams or only one of these streams?

推荐答案

需要检查点以通过流保存有关已处理数据的信息,并且如果发生故障,火花可能会从上次保存的进度点恢复.处理意味着从源读取(转换),最后将其写入接收器.

Checkpointing is needed to save information about processed data by a stream and in case of failure spark could recover from last saved progress point. Processed means it is read from source, (transformed) and finally written to a sink.

因此,无需为读取器和写入器分别设置检查点,因为恢复后不处理仅读取但未写入接收器的数据是没有意义的.此外,检查点位置可以设置为仅DataStreamWriter的选项(从 dataset.writeStream()返回)并且在开始流之前.

Therefore, there is no need to set checkpointing for reader and writer separately since it make no sense after recovery not to process the data that was only read but not written to a sink. Moreover, checkpointing location can be set as an option to DataStreamWriter only (returns from dataset.writeStream()) and before starting a stream.

以下是带有检查点的简单结构化流的示例:

Here is an example of a simple structured stream with checkpointing:

session
    .readStream()
    .schema(RecordSchema.fromClass(TestRecord.class))
    .csv("s3://test-bucket/input")
    .as(Encoders.bean(TestRecord.class))
    .writeStream()
    .outputMode(OutputMode.Append())
    .format("csv")
    .option("path", "s3://test-bucket/output")
    .option("checkpointLocation", "s3://test-bucket/checkpoint")
    .queryName("test-query")
    .start();

这篇关于我们是否需要在Spark结构化流传输中同时检查Kafka的readStream和writeStream?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 07:07