问题描述
现在我必须在 pyspark (Spark 2.1.0) 中使用 sc.parallelize() 创建一个并行化集合.
我的驱动程序中的集合很大.并行的时候发现在master节点上占用了很多内存.
似乎在我将它并行化到每个工作节点之后,集合仍然保存在在主节点的spark的内存中.这是我的代码示例:
#我的python代码sc = SparkContext()a = [1.0] * 1000000000rdd_a = sc.parallelize(a, 1000000)sum = rdd_a.reduce(lambda x, y: x+y)
我试过了
del a
摧毁它,但它没有用.作为 java 进程 的 spark 仍在使用大量内存.
创建rdd_a后,如何销毁a来释放主节点的内存?>
谢谢!
master 的工作是协调 worker 并在 worker 完成当前任务后给它一个新任务.为了做到这一点,master 需要跟踪所有需要为给定计算完成的任务.
现在,如果输入是一个文件,任务看起来就像从 X 到 Y 读取文件 F".但是因为输入一开始就在内存中,所以任务看起来像 1,000 个数字.考虑到 master 需要跟踪所有 1,000,000 个任务,这会变得非常大.
Now I have to create a parallelized collection using sc.parallelize() in pyspark (Spark 2.1.0).
The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.
It seems that the collection is still being kept in spark's memory of the master node after I parallelize it to each worker node.Here's an example of my code:
# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)
I've tried
del a
to destroy it, but it didn't work. The spark which is a java process is still using a lot of memory.
After I create rdd_a, how can I destroy a to free the master node's memory?
Thanks!
The job of the master is to coordinate the workers and to give a worker a new task once it has completed its current task. In order to do that, the master needs to keep track of all of the tasks that need to be done for a given calculation.
Now, if the input were a file, the task would simply look like "read file F from X to Y". But because the input was in memory to begin with, the task looks like 1,000 numbers. And given the master needs to keep track of all 1,000,000 tasks, that gets quite large.
这篇关于为什么 SparkContext.parallelize 使用驱动程序的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!