问题描述
我在一个相当小的数据做一个简单的GROUPBY(80文件HDFS,总共几音乐会)。我在一家纱线集群上8低内存的机器上运行星火,即沿着线的东西:
I'm doing a simple groupBy on a fairly small dataset (80 files in HDFS, few gigs in total). I'm running Spark on 8 low-memory machines in a yarn cluster, i.e. something along the lines of:
火花提交... --master纱客户--num-执行人8 --executor内存3000米--executor-芯1
该数据集由长度为500-2000的字符串。
The dataset consists of strings of length 500-2000.
我试图做一个简单的 groupByKey
(见下文),但它无法用 java.lang.OutOfMemoryError:GC开销限制超过
例外
I'm trying to do a simple groupByKey
(see below), but it fails with a java.lang.OutOfMemoryError: GC overhead limit exceeded
exception
val keyvals = sc.newAPIHadoopFile("hdfs://...")
.map( someobj.produceKeyValTuple )
keyvals.groupByKey().count()
我可以使用 reduceByKey
没有问题,确保自己的问题不是由一个单一的过大引起的群体,也不是群体过量的计数组大小:
I can count the group sizes using reduceByKey
without problems, ensuring myself the problem isn't caused by a single excessively large group, nor by an excessive amount of groups :
keyvals.map(s => (s._1, 1)).reduceByKey((a,b) => a+b).collect().foreach(println)
// produces:
// (key1,139368)
// (key2,35335)
// (key3,392744)
// ...
// (key13,197941)
我试过重新格式化,重新洗牌,增加并行GROUPBY级别:
I've tried reformatting, reshuffling and increasing the groupBy level of parallelism:
keyvals.groupByKey(24).count // fails
keyvals.groupByKey(3000).count // fails
keyvals.coalesce(24, true).groupByKey(24).count // fails
keyvals.coalesce(3000, true).groupByKey(3000).count // fails
keyvals.coalesce(24, false).groupByKey(24).count // fails
keyvals.coalesce(3000, false).groupByKey(3000).count // fails
我试着 spark.default.parallelism
玩耍,并增加 spark.shuffle.memoryFraction
来 0.8
,同时降低 spark.storage.memoryFraction
到 0.1
I've tried playing around with spark.default.parallelism
, and increasing spark.shuffle.memoryFraction
to 0.8
while lowering spark.storage.memoryFraction
to 0.1
有故障的阶段(计数)将于3000任务2999失败。
The failing stage (count) will fail on task 2999 of 3000.
我似乎无法找到任何暗示GROUPBY不应该只是溢出到磁盘,而不是保持事物的记忆,但我只是无法得到它的工作权利,甚至在相当小的数据集。这应该obviosuly不是这样的,我一定是做错了什么,但我不知道从哪里开始调试这个!
I can't seem to find anything that suggests that groupBy shouldn't just spill to disk instead of keeping things in memory, but I just can't get it to work right, even on fairly small datasets. This should obviosuly not be the case, and I must be doing something wrong, but I have no idea where to start debugging this!
推荐答案
帕特里克·温德尔洒在GROUPBY运营商的一些细节指示灯on邮件列表。外卖消息如下:
Patrick Wendell shed some light on the details of the groupBy operator on the mailing list. The takeaway message is the following:
在分区中的东西将波及[...]这个溢出只能出现的跨键的的时刻。溢出不能在present的关键之内发生。 [...]对于GROUPBY的很可能是星火的下一个版本落得一键内溢出,星火1.2。 [...]如果目标是从字面上只写到磁盘与每个组相关联的所有值,并配有单组关联的值比适合在内存较大,这可能无法立即与GROUPBY运营商来完成。
他进一步提出一个变通办法:
He further suggests a work-around:
要解决这一点的最好方法取决于你正在尝试与下游的数据做了一下。典型的方法涉及子分割的非常大的群体,例如,附加在小范围内(1-10)大键散列值。然后,你的下游code的处理每一组聚合部分值。如果你的目标仅仅是各组布置在磁盘上的顺序上一个大文件,你可以叫 sortByKey
用散列后缀为好。排序功能是外在的星火1.1(这是在pre-释放)。
这篇关于星火GROUPBY内存不足的困境的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!