每次运行与样本相关的内容时，Spark 都会重新采样我的数据

本文介绍了每次运行与样本相关的内容时，Spark 都会重新采样我的数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在数据集上运行分层样本，其中我将样本保存在名为 df 的数据帧上.在 df 上运行计数时，每次运行计数(不重新运行分层采样)时，它都会给我不同的计数，就好像每次对 df 进行操作时，我的数据都会重新采样.我有一个种子设置为 12，我使用 spark 函数 sampleBy.

I am running a stratified sample on a dataset, in which the sample I keep on a dataframe called df. When running a count on df, everytime I run the count (without re-running the stratified sampling), it gives me different count as if every time I do an operation on df, my data gets re-sampled. I have a seed set as 12 and I use the spark function sampleBy.

我是 Spark 新手，这正常吗?我该如何解决这个问题?

I am pretty new in Spark, is this normal? How do I counteract this issue?

推荐答案

如果没有代码，有点难以确定，但是，如果您不在任何地方缓存/持久化您的数据帧，那么 spark 将重新运行直到您调用 .count() 之类的操作为止的所有内容.因此，如果您在某个时间点使用随机种子对数据进行采样，则采样将重新运行，因此结果不同.

It is a bit hard to tell for sure without the code but, If you don't cache/ persist your data frame anywhere, then spark will re-run everything up to the point where you call an action like .count(). So, if you are sampling your data at some point with a random seed, then the sampling will re-run, thus the different result.

您可以使用 df = df.cache() 或 df = df.persist() 例如当您第一次加载数据并在采样之后立即让 spark 创建一个断点，而不是重新运行所有内容.

You can use df = df.cache() or df = df.persist() e.g. when you first load the data and right after the sampling to have spark create a sort-of a break point and not re-run everything.

文档链接

希望能帮到你，祝你好运！

I hope this helps, good luck!

这篇关于每次运行与样本相关的内容时，Spark 都会重新采样我的数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！