问题描述
我有一个RDD,其元素是类型(Long,String)。出于某种原因,我想将整个RDD保存到HDFS中,并且稍后还将该RDD读回到Spark程序中。有可能这样做吗?如果是这样,怎么样?
这是可能的。在RDD中,您有 saveAsObjectFile
和 saveAsTextFile
函数。元组被存储为(value1,value2)
,所以您可以稍后解析它。
可以用 textFile
函数从SparkContext,然后 .map
消除()
因此:
版本1:
rdd .saveAsTextFile(hdfs:/// test1 /);
//以后,在其他程序中
val newRdds = sparkContext.textFile(hdfs:/// test1 / part- *).map(x => {
// here remove()和解析长/字符串
})
版本2:
rdd.saveAsObjectFile(hdfs:/// test1 /);
//以后,在其他程序中 - 请注意,您的元组已开箱即用:)
val newRdds = sparkContext.sc.sequenceFile(hdfs:/// test1 / part- *,classOf [Long],classOf [String])
I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?
It is possible.
In RDD you have saveAsObjectFile
and saveAsTextFile
functions. Tuples are stored as (value1, value2)
, so you can later parse it.
Reading can be done with textFile
function from SparkContext and then .map
to eliminate ()
So:Version 1:
rdd.saveAsTextFile ("hdfs:///test1/");
// later, in other program
val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x => {
// here remove () and parse long / strings
})
Version 2:
rdd.saveAsObjectFile ("hdfs:///test1/");
// later, in other program - watch, you have tuples out of the box :)
val newRdds = sparkContext.sc.sequenceFile("hdfs:///test1/part-*", classOf[Long], classOf[String])
这篇关于我如何将RDD保存到HDFS中,然后再读回它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!