问题描述
我试图使用sqoop将MySQL中的1 TB表导入HDFS。使用的命令是:
sqoop import --connect jdbc:mysql://xx.xx.xxx.xx/MyDB --username myuser --password mypass --table mytable --split-by rowkey -m 14
执行bounding vals查询后,所有映射器都会启动,但过了一段时间后,任务因超时(1200秒)而死亡。我认为这是因为执行在每个映射器中运行的查询所花费的时间超过了为超时设置的时间(在sqoop中它似乎是1200秒) ;因此它没有报告状态,任务随后被杀死。 (我也尝试过使用100 GB数据集;由于多个映射器超时,它仍然失败。)对于单个映射器导入,它工作正常,因为不需要过滤结果集。在sqoop中使用多个映射器时,是否有任何方法可以覆盖映射任务超时(比如设置为
0
或非常高的值)?
Sqoop使用特殊线程发送状态,以便map任务不会被jobtracker杀死。我会有兴趣进一步探索你的问题。你介意共享sqoop日志,地图任务日志和表格模式之一吗?
Jarcec
I was trying to import a 1 TB table in MySQL to HDFS using sqoop. The command used was:
sqoop import --connect jdbc:mysql://xx.xx.xxx.xx/MyDB --username myuser --password mypass --table mytable --split-by rowkey -m 14
After executing the bounding vals query, all the mappers start, but after some time, the tasks get killed due to timeout (1200 seconds). This, I think, is because the time taken to execute the select
query running in each mapper takes more than the time set for timeout (in sqoop it seems to be 1200 seconds); and hence it fails to report status, and the task subsequently gets killed. (I have also tried it for 100 gb data sets; it still failed due to timeout for multiple mappers.) For single mapper import, it works fine, as no filtered resultsets are needed. Is there any way to override the map task timeout (say set it to 0
or a very high value) while using multiple mappers in sqoop?
Sqoop is using special thread to send statuses so that the map task won't get killed by jobtracker. I would be interested to explore your issue further. Would you mind sharing the sqoop log, one of the map task logs and your table schema?
Jarcec
这篇关于由于任务超时,Sqoop导入作业失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!