Scala:如何获取数据框中的行范围 | 如何获取数据框中的行范围

本文介绍了Scala:如何获取数据框中的行范围的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个通过运行Parquet文件的sqlContext.read创建的DataFrame.

I have a DataFrame created by running sqlContext.read of a Parquet file.

DataFrame由3亿行组成.我需要将这些行用作另一个函数的输入，但是我想分批进行以防止OOM错误.

The DataFrame consists of 300 M rows. I need to use these rows as input to another function, but I want to do it in smaller batches to prevent OOM error.

当前，我正在使用df.head(1000000)来读取前1M行，但是我找不到找到读取后续行的方法.我尝试了df.collect()，但是它给了我一个Java OOM错误.

Currently, I am using df.head(1000000) to read the first 1M rows, but I cannot find a way to read the subsequent rows. I tried df.collect(), but it gives me a Java OOM error.

我想遍历此数据框.我尝试使用withColumn()API添加另一列以生成一组唯一的值进行迭代，但是数据框中现有的任何列都不具有唯一的值.

I want to iterate over this dataframe. I tried adding another column with the withColumn() API to generate a unique set of values to iterate over, but none of the existing columns in the dataframe have solely unique values.

例如，我尝试了val df = df1.withColumn("newColumn", df1("col") + 1)和val df = df1.withColumn("newColumn",lit(i+=1))，它们都不返回顺序的一组值.

For example, I tried val df = df1.withColumn("newColumn", df1("col") + 1) as well as val df = df1.withColumn("newColumn",lit(i+=1)), both of which do not return a sequential set of values.

以其他任何方式获取数据帧的前n行，然后获取后n行，是否可以像SqlContext的范围函数那样工作?

Any other way to get the first n rows of a dataframe and then the next n rows, something that works like a range function of SqlContext?

推荐答案

您可以按如下所示简单使用数据集或数据帧的限制和api

You can simple use the limit and except api of dataset or dataframes as follows

long count = df.count();
int limit = 50;
while(count > 0){
    df1 = df.limit(limit);
    df1.show();            //will print 50, next 50, etc rows
    df = df.except(df1);
    count = count - limit;
}

这篇关于Scala:如何获取数据框中的行范围的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！