如何在 SparkSQL 中使用 Dataframe 获取行迭代器

本文介绍了如何在 SparkSQL 中使用 Dataframe 获取行迭代器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 SparkSQL 中有一个应用程序，它返回大量难以放入内存的行，因此我将无法在 DataFrame 上使用 collect 函数，有没有一种方法可以将所有这些行作为可迭代地将整行替换为列表.

I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list.

我正在使用 yarn-client 执行这个 SparkSQL 应用程序.

I am executing this SparkSQL application using yarn-client.

推荐答案

一般来说，将所有数据传输到驱动程序看起来是一个很糟糕的主意，而且大多数时候有更好的解决方案，但如果你真的想去有了这个，你可以在 RDD 上使用 toLocalIterator 方法:

Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you really want to go with this you can use toLocalIterator method on a RDD:

val df: org.apache.spark.sql.DataFrame = ???
df.cache // Optional, to avoid repeated computation, see docs for details
val iter: Iterator[org.apache.spark.sql.Row]  = df.rdd.toLocalIterator

这篇关于如何在 SparkSQL 中使用 Dataframe 获取行迭代器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！