从takeOrdered返回RDD，而不是列表

本文介绍了从takeOrdered返回RDD，而不是列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用pyspark进行一些数据清理.一个非常常见的操作是获取文件的一小部分子集并将其导出以进行检查:

I'm using pyspark to do some data cleaning. A very common operation is to take a small-ish subset of a file and export it for inspection:

(self.spark_context.textFile(old_filepath+filename)
    .takeOrdered(100)
    .saveAsTextFile(new_filepath+filename))

我的问题是takeOrdered返回的是列表而不是RDD，所以saveAsTextFile不起作用.

My problem is that takeOrdered is returning a list instead of an RDD, so saveAsTextFile doesn't work.

AttributeError: 'list' object has no attribute 'saveAsTextFile'

当然，我可以实现自己的文件编写器.或者，我可以使用并行化将列表转换回RDD.但我想在这里成为火花纯粹主义者.

Of course, I could implement my own file writer. Or I could convert the list back into an RDD with parallelize. But I'm trying to be a spark purist here.

没有任何方法可以从takeOrdered或等效函数返回RDD吗?

Isn't there any way to return an RDD from takeOrdered or an equivalent function?

而不是列表

从takeOrdered返回RDD，而不是列表

问题描述

推荐答案