问题描述
我在SparkSQL中有一个应用程序,该应用程序返回很难容纳在内存中的大量行,因此我将无法在DataFrame上使用collect函数,有没有一种方法可以将所有这些行作为整个行的迭代设置为列表.
I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list.
注意:我正在使用yarn-client执行此SparkSQL应用程序
Note: I am executing this SparkSQL application using yarn-client
推荐答案
通常来说,将所有数据传输到驱动程序似乎是个坏主意,大多数情况下都有更好的解决方案,但是如果您真的想去的话这样,您可以在RDD上使用toLocalIterator
方法:
Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you really want to go with this you can use toLocalIterator
method on a RDD:
val df: org.apache.spark.sql.DataFrame = ???
df.cache // Optional, to avoid repeated computation, see docs for details
val iter: Iterator[org.apache.spark.sql.Row] = df.rdd.toLocalIterator
这篇关于如何在SparkSQL中使用Dataframe获取行的迭代器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!