问题描述
我正在运行一个应用程序,它将s3中的数据(.csv)加载到DataFrame中,然后将这些Dataframe注册为临时表.之后,我使用SparkSQL联接这些表,最后将结果写入db.对于我来说,当前的瓶颈是我感觉任务没有平均分配,集群中没有多个节点,也没有获得任何好处或并行化.更准确地说,这是问题阶段任务持续时间的分布任务持续时间分布我有办法加强分配的平衡吗?也许手动编写map/reduce函数?不幸的是,此阶段还有6个仍在运行的任务(1.7小时atm),这将证明更大的偏差.
I'm running application that loads data (.csv) from s3 into DataFrames, and than register those Dataframes as temp tables. After that, I use SparkSQL to join those tables and finally write result into db. Issue that is currently bottleneck for me is that I feel tasks are not evenly split and i get no benefits or parallelization and multiple nodes inside cluster. More precisely, this is distribution of task duration in problematic stagetask duration distributionIs there way for me to enforce more balanced distribution ? Maybe manually writing map/reduce functions ?Unfortunately, this stage has 6 more tasks that are still running (1.7 hours atm), which will prove even greater deviation.
推荐答案
有两种可能:一种在您的控制之下,.. 不幸的是一种可能是 not ..
There are two likely possibilities: one is under your control and .. unfortunately one is likely not ..
- 倾斜的数据.检查分区的大小是否相对相似-例如在三到四倍之内.
- Spark任务运行时的固有可变性.我已经看到Spark Standalone,Yarn和Mesos上的散布器出现大量延迟,而没有明显的原因.症状是:
- 延长的时间(分钟),在托管散乱任务的节点上几乎没有cpu或磁盘活动发生
- 数据大小与散乱者之间没有明显的关联性
- 不同节点/工作人员可能会在同一工作的后续运行中遇到延迟
- Skewed data. Check that the partitions are of relatively similar size - say within a factor of three or four.
- Inherent variability of Spark tasks runtime. I have seen behavior of large delays in stragglers on Spark Standalone, Yarn, and Mesos without an apparent reason. The symptoms are:
- extended periods (minutes) where little or no cpu or disk activity were occurring on the nodes hosting the straggler tasks
- no apparent correlation of data size to the stragglers
- different nodes/workers may experience the delays on subsequent runs of the same job
要检查的一件事:做
hdfs dfsadmin -report
和hdfs fsck
来查看hdfs是否健康.One thing to check: do
hdfs dfsadmin -report
andhdfs fsck
to see if hdfs were healthy.这篇关于Spark任务持续时间差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!