本文介绍了SPARK Dataframes 上的采样方法是统一采样吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想从数据框中随机选择一定数量的行,我知道样本方法可以做到这一点,但我担心我的随机性应该是均匀采样?所以,我想知道Spark on Dataframes的示例方法是否统一?
I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not?
谢谢
推荐答案
这里有几个代码路径:
- 如果
withReplacement = false &&分数>.4
然后它使用增强的随机数生成器 (rng.nextDouble() ) 并让它完成工作.这看起来很统一.
- 如果
withReplacement = false &&分数 然后它使用更复杂的算法(
GapSamplingIterator
) 看起来也很统一.乍一看,好像也应该是统一的 - 如果
withReplacement = true
它确实接近相同的事情,除了它的外观可以复制,所以这在我看来不会像第一个一样统一两个
- If
withReplacement = false && fraction > .4
then it uses a souped up random number generator (rng.nextDouble() <= fraction
) and lets that do the work. This seems like it would be pretty uniform. - If
withReplacement = false && fraction <= .4
then it uses a more complex algorithm (GapSamplingIterator
) that also seems pretty uniform. At a glance, it looks like it should be uniform also - If
withReplacement = true
it does close to the same thing, except it can duplicate by the looks of it, so this looks to me like it would not be as uniform as the first two
这篇关于SPARK Dataframes 上的采样方法是统一采样吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!