本文介绍了数据帧示例在Apache spark |斯卡拉的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我正在尝试从两个数据框中取出样本,其中需要维护的比例。例如I'm trying to take out samples from two dataframes wherein I need the ratio of count maintained. egdf1.count() = 10df2.count() = 1000noOfSamples = 10我想以这样一种方式对数据进行采样,我得到10个尺寸样本每个101个(1个来自df1和100个来自df2)I want to sample the data in such a way that i get 10 samples of size 101 each( 1 from df1 and 100 from df2)现在,这样做var newSample = df1.sample(true, df1.count() / noOfSamples)println(newSample.count())这里的分数是什么意思?可以大于1吗?我检查了此和这个,但不能完全理解。What does the fraction here imply? can it be greater than 1? I checked this and this but wasn't able to comprehend it fully.还有,我们可以指定要采样的行数吗?Also is there anyway we can specify the number of rows to be sampled?推荐答案 分数参数表示将要返回的数据集的 aproximate 分数。例如,如果将其设置为 0.1 ,则将返回10%(1/10)的行。对于您的情况,我相信您要执行以下操作:The fraction parameter represents the aproximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:val newSample = df1.sample(true, 1D*noOfSamples/df1.count)但是,您可能会注意到 newSample.count 将在每次运行它时返回不同的数字,这是因为分数将成为随机生成值的阈值(可以看到这里),因此生成的数据集大小可能会有所不同。解决方法可以是:However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value (as you can see here), so the resulting dataset size can vary. An workaround can be:val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)对于您的问题: 可以大于1吗? can it be greater than 1?否。它代表一个分数,因此它必须是介于0和1之间的十进制数。如果将其设置为1,它将带来100%的行,因此将其设置为大于1的数字是没有意义的。 / p>No. It represents a fraction, so it must be a decimal number between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1. 还有,我们可以指定要采样的行数吗? Also is there anyway we can specify the number of rows to be sampled?您可以指定比所需行数更大的分数,然后使用limit,如第二个示例所示。也许有另一种方式,但这是我使用的方法。You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use. 这篇关于数据帧示例在Apache spark |斯卡拉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
09-05 16:23
查看更多