本文介绍了Spark DataFrame-选择n个随机行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含成千上万条记录的数据框,我想随机地将1000行选择到另一个数据框中进行演示.如何用Java做到这一点?

I have a dataframe with multiple thousands of records, and I'd like to randomly select 1000 rows into another dataframe for demoing. How can I do this in Java?

谢谢!

推荐答案

您可以尝试sample()方法.不幸的是,您必须给的不是数字,而是分数.您可以这样编写函数:

You can try sample () method. Unfourtunatelly you must give there not a number, but fraction. You can write function like this:

def getRandom (dataset : Dataset[_], n : Int) = {
    val count = dataset.count();
    val howManyTake = if (count > n) n else count;
    dataset.sample(0, 1.0*howManyTake/count).limit (n)
}

说明:我们必须获取一小部分数据.如果我们有2000行,而您想获得100行,则必须有0.5行.如果要获得比DataFrame多的行,则必须得到1.0.调用limit()函数以确保四舍五入是正确的,并且得到的行数不超过指定的数.

Explanation:we must take a fraction of data. If we have 2000 rows and you want to get 100 rows, we must have 0.5 of total rows. If you want to get more rows than there are in DataFrame, you must get 1.0.limit () function is invoked to make sure that rounding is ok and you didn't get more rows than you specified.

我在其他答案中看到了takeSample方法.但是请记住:

I see in other answer the takeSample method. But remember:

  1. 这是RDD的一种方法,而不是数据集,因此您必须执行以下操作:dataset.rdd.takeSample(0, 1000, System.currentTimeMilis()).toDF()takeSample将收集所有值.
  2. 请记住,如果要获取很多行,则将出现OutOfMemoryError问题,因为takeSample在驱动程序中收集结果.小心使用
  1. It'a a method of RDD, not Dataset, so you must do:dataset.rdd.takeSample(0, 1000, System.currentTimeMilis()).toDF()takeSample will collect all values.
  2. Remember that if you want to get very many rows then you will have problems with OutOfMemoryError as takeSample is collecting results in driver. Use it carefully

这篇关于Spark DataFrame-选择n个随机行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 16:06