我刚刚创建了 range(1,100000) 的 python 列表.

I've just created python list of range(1,100000).

使用 SparkContext 完成以下步骤:

Using SparkContext done the following steps:

a = sc.parallelize([i for i in range(1, 100000)])
b = sc.parallelize([i for i in range(1, 100000)])

c = a.zip(b)

>>> [(1, 1), (2, 2), -----]

sum  = sc.accumulator(0)

c.foreach(lambda (x, y): life.add((y-x)))


ARN TaskSetManager:第 3 阶段包含一个非常大的任务 (4644 KB).建议的最大任务大小为 100 KB.


How to resolve this warning? Is there any way to handle size? And also, will it affect the time complexity on big data?


扩展@leo9r 评论:考虑使用的不是python range,而是sc.range https://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.SparkContext.range.

Expanding @leo9r comment: consider using not a python range, but sc.range https://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.SparkContext.range.


Thus you avoid transfer of huge list from your driver to executors.

当然,此类 RDD 通常仅用于测试目的,因此您不希望它们被广播.

Of course, such RDDs are usually used for testing purposes only, so you do not want them to be broadcasted.

