问题描述
在两个Spark作业之间共享Spark RDD数据的最佳方法是什么.
What is the best way to share spark RDD data between two spark jobs.
我遇到一种情况:作业1:Spark Sliding window Streaming App将定期使用数据并创建RDD.这是我们不想持久存储的.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
作业2:查询作业,该作业将访问在作业1中创建的相同RDD并生成报告.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
我很少在查询中建议SPARK Job Server的查询,但是由于它是开放源代码,因此不确定它是否可能解决方案,但是任何指针都将有很大帮助.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
谢谢!
推荐答案
简短的答案是您不能在作业之间共享RDD.共享数据的唯一方法是将数据写入HDFS,然后将其放入其他作业中.如果速度是一个问题,并且您想要保持恒定的数据流,则可以使用HBase,它可以从第二个作业中非常快速地进行访问和处理.
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
要获得更好的主意,您应该在这里查看:
To get a better idea you should look here:
这篇关于如何在两个应用程序之间共享Spark RDD中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!