问题描述
我有很多数据,并且已经对基数分区[20k,200k +]进行了实验.
I have many data and I have experimented with partitions of cardinality [20k, 200k+].
我这样称呼它:
from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)
,我看到 initRandom()调用takeSample()
一次.
然后 takeSample()实现似乎没有调用自身或类似的东西,因此我希望KMeans()
调用一次takeSample()
.那么为什么监视器在每个KMeans()
上显示两个takeSample()
?
Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans()
to call takeSample()
once. So why the monitor shows two takeSample()
s per KMeans()
?
注意:我执行了更多的KMeans()
,并且它们都调用两个takeSample()
,而不管是否需要.cache()
数据.
Note: I execute more KMeans()
and they all invoke two takeSample()
s, regardless of the data being .cache()
'd or not.
此外,分区数不影响takeSample()
的调用数,它恒定为2.
Moreover, the number of partitions doesn't affect the number takeSample()
is called, it's constant to 2.
我正在使用Spark 1.6.2(并且无法升级),并且如果需要的话,我的应用程序也使用Python!
I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!
我把它带到了Spark开发者的邮件列表中,所以我要进行更新:
I brought this to the mailing list of the Spark devs, so I am updating:
第一个takeSample()
的详细信息:
第二个takeSample()
的详细信息:
可以看到执行了相同的代码.
where one can see that the same code is executed.
推荐答案
我认为takeSample本身会运行多个作业(如果样本量很大)在第一阶段收集的数据还不够.注释和代码路径在 GitHub 应该解释这种情况何时发生.您也可以通过以下方式确认检查logWarning是否出现在您的日志中.
I think takeSample itself runs multiple jobs if the amount of samplescollected in the first pass is not enough. The comment and code pathat GitHubshould explain when this happens. Also you can confirm this bychecking if the logWarning shows up in your logs.
// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
但是,正如人们看到的那样,第二条评论说它不应该经常发生,而且它确实总是在我身上发生,所以如果有人有其他想法,请告诉我.
However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.
也有人提出这是UI的问题,takeSample()
实际上只被调用过一次,但这只是热空气.
It was also suggested that this was a problem of the UI and takeSample()
was actually called only once, but that was just hot air.
这篇关于Spark :: KMeans两次调用takeSample()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!