数据帧示例在Apache spark |斯卡拉

本文介绍了数据帧示例在Apache spark |斯卡拉的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！我正在尝试从两个数据框中取出样本，其中需要维护的比例。例如I'm trying to take out samples from two dataframes wherein I need the ratio of count maintained. egdf1.count() = 10df2.count() = 1000noOfSamples = 10我想以这样一种方式对数据进行采样，我得到10个尺寸样本每个101个（1个来自df1和100个来自df2）I want to sample the data in such a way that i get 10 samples of size 101 each( 1 from df1 and 100 from df2)现在，这样做var newSample = df1.sample(true, df1.count() / noOfSamples)println(newSample.count())这里的分数是什么意思？可以大于1吗？我检查了此和这个，但不能完全理解。What does the fraction here imply? can it be greater than 1? I checked this and this but wasn't able to comprehend it fully.还有，我们可以指定要采样的行数吗？Also is there anyway we can specify the number of rows to be sampled?推荐答案分数参数表示将要返回的数据集的 aproximate 分数。例如，如果将其设置为 0.1 ，则将返回10％（1/10）的行。对于您的情况，我相信您要执行以下操作：The fraction parameter represents the aproximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:val newSample = df1.sample(true, 1D*noOfSamples/df1.count)但是，您可能会注意到 newSample.count 将在每次运行它时返回不同的数字，这是因为分数将成为随机生成值的阈值（可以看到这里），因此生成的数据集大小可能会有所不同。解决方法可以是：However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value (as you can see here), so the resulting dataset size can vary. An workaround can be:val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)对于您的问题：可以大于1吗？ can it be greater than 1?否。它代表一个分数，因此它必须是介于0和1之间的十进制数。如果将其设置为1，它将带来100％的行，因此将其设置为大于1的数字是没有意义的。 / p>No. It represents a fraction, so it must be a decimal number between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1. 还有，我们可以指定要采样的行数吗？ Also is there anyway we can specify the number of rows to be sampled?您可以指定比所需行数更大的分数，然后使用limit，如第二个示例所示。也许有另一种方式，但这是我使用的方法。You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use. 这篇关于数据帧示例在Apache spark |斯卡拉的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！