阅读Spark批处理作业中的Kafka主题

阅读Spark批处理作业中的Kafka主题

本文介绍了阅读Spark批处理作业中的Kafka主题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个从Kafka主题读取的Spark(v1.6.0)批处理作业.
为此,我可以使用org.apache.spark.streaming.kafka.KafkaUtils#createRDD我需要为所有分区设置偏移量,还需要将它们存储在某个位置(ZK,HDFS?),以了解从哪里开始下一个批处理作业.

I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD however,I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job.

批处理作业中从Kafka读取的正确方法是什么?

What is the right approach to read from Kafka in a batch job?

我也在考虑编写一个作业,该作业从auto.offset.reset=smallest读取并保存检查点到HDFS,然后在下一次运行中从该位置开始.

I'm also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest and saves the checkpointto HDFS and then in the next run it starts from that.

但是在这种情况下,我如何只提取一次并在第一批之后停止流式传输?

But in this case how can I just fetch once and stop streaming after the first batch?

推荐答案

createRDD是从kafka读取批处理的正确方法.

createRDD is the right approach for reading a batch from kafka.

要查询有关最新/最早可用偏移量的信息,请查看KafkaCluster.scala方法getLatestLeaderOffsetsgetEarliestLeaderOffsets.该文件为private,但在最新版本的spark中应为public.

To query for info about the latest / earliest available offsets, look at KafkaCluster.scala methods getLatestLeaderOffsets and getEarliestLeaderOffsets. That file was private, but should be public in the latest versions of spark.

这篇关于阅读Spark批处理作业中的Kafka主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-01 20:52