问题描述
我正在使用pyspark进行一些数据清理.一个非常常见的操作是获取文件的一小部分子集并将其导出以进行检查:
I'm using pyspark to do some data cleaning. A very common operation is to take a small-ish subset of a file and export it for inspection:
(self.spark_context.textFile(old_filepath+filename)
.takeOrdered(100)
.saveAsTextFile(new_filepath+filename))
我的问题是takeOrdered返回的是列表而不是RDD,所以saveAsTextFile不起作用.
My problem is that takeOrdered is returning a list instead of an RDD, so saveAsTextFile doesn't work.
AttributeError: 'list' object has no attribute 'saveAsTextFile'
当然,我可以实现自己的文件编写器.或者,我可以使用并行化将列表转换回RDD.但我想在这里成为火花纯粹主义者.
Of course, I could implement my own file writer. Or I could convert the list back into an RDD with parallelize. But I'm trying to be a spark purist here.
没有任何方法可以从takeOrdered或等效函数返回RDD吗?
Isn't there any way to return an RDD from takeOrdered or an equivalent function?
推荐答案
takeOrdered()
是操作,而不是转换,因此您不能让它返回RDD.
如果不需要订购,则最简单的选择是sample()
.
如果确实要排序,则可以尝试filter()
和sortByKey()
的某种组合以减少元素数量并对其进行排序.或者,按照您的建议,重新并行化takeOrdered()
takeOrdered()
is an action and not a transformation so you can't have it return an RDD.
If ordering isn't necessary, the simplest alternative would be sample()
.
If you do want ordering, you can try some combination of filter()
and sortByKey()
to reduce the number of elements and sort them. Or, as you suggested, re-parallelize the result of takeOrdered()
这篇关于从takeOrdered返回RDD,而不是列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!