问题描述
我正在从事火花工作.它显示所有工作均已完成:
I'm running a spark job. It shows that all of the jobs were completed:
不过,几分钟后,整个作业将重新启动,这一次它将显示所有作业和任务也已完成,但是几分钟后,它将失败.我在日志中发现了此异常:
however after couple of minutes the entire job restarts, this time it will show all jobs and tasks were completed too, but after couple of minutes it will fail.I found this exception in the logs:
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
因此,当我尝试连接2个非常大的表时就会发生这种情况:3B行之一,第二行为200M行,当我在结果数据帧上运行show(100)
时,所有内容都经过评估,而我得到了这个问题.
So this happens when I'm trying to join 2 pretty big tables: one of 3B rows, and the second is 200M rows, when I run show(100)
on the resulting dataframe, everything gets evaluated and I'm getting this issue.
我尝试增加/减少分区数,然后通过增加线程数将垃圾回收器更改为G1.我将spark.sql.broadcastTimeout
更改为600(这使超时消息更改为600秒).
I tried playing around with increasing/decreasing the number of partitions, I changed the garbage collector to G1 with increased number of threads. I changed spark.sql.broadcastTimeout
to 600 (which made the time out message to change to 600 seconds).
我还读到这可能是一个通信问题,但是在此代码段之前运行的其他show()
子句可以正常工作,所以可能不是.
I also read that this might be a communication issue, however other show()
clauses that run prior this code segment work without problems, so it's probably not it.
这是Submit命令:
This is the submit command:
/opt/spark/spark-1.4.1-bin-hadoop2.3/bin/spark-submit --master yarn-cluster --class className --executor-memory 12g --executor-cores 2 --driver-memory 32g --driver-cores 8 --num-executors 40 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:ConcGCThreads=20" /home/asdf/fileName-assembly-1.0.jar
您可以了解有关Spark版本以及从那里使用的资源的想法.
you can get the idea about spark versions, and the resources used from there.
我从这里去哪里?我们将不胜感激,如有需要,还将提供代码段/其他日志记录.
Where do I go from here? Any help will be appreciated, and code segments/additional logging will be provided if needed.
推荐答案
最终解决此问题的方法是在连接之前持久保留两个数据帧.
What solved this eventually was persisting both data frames before join.
我查看了持久保存数据帧之前和之后的执行计划,奇怪的是,在持久保存spark之前尝试执行BroadcastHashJoin
,显然由于数据帧的大小而失败,并且持久保存了.执行计划显示联接将为ShuffleHashJoin
,该联接已完成,没有任何问题.有毛病吗也许,我会尝试使用更新的Spark版本.
I looked at the execution plan before and after persisting the data frames, and the strange thing was that before persisting spark tried to perform a BroadcastHashJoin
, which clearly failed due to large size of the data frame, and after persisting the execution plan showed that the join will be ShuffleHashJoin
, that completed without any issues whatsoever. A bug? Maybe, I'll try with a newer spark version when I'll get to it.
这篇关于显示所有作业完成后,Spark作业重新启动,然后失败(TimeoutException:[300秒]之后,期货超时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!