我正在使用TestDFSIO和TeraSort基准测试工具进行多项Hadoop测试。我基本上用不同数量的datanodes进行测试,以评估处理能力和数据节点可伸缩性的线性。在上述过程中,我显然必须重启几次所有Hadoop环境。每次我重新启动Hadoop时,所有MapReduce作业都将被删除,作业计数器将再次从job_2013 * _0001开始。出于比较的原因,保持我之前启动的所有MapReduce作业对我来说非常重要。所以,我的问题是:
重新启动hadoop后,MR作业历史日志不会被正确删除,新作业将从* _0001开始计数,只有在hadoop重新启动后启动的新作业才会显示在资源管理器中网络门户虽然。实际上,有2个日志相关的设置来自:
纱线的位置。 nodemanager.log-dirs = $ {yarn.log.dir} / userlogs
yarn.nodemanager.log.retain-seconds = 10800
和默认的$ {yarn.log.dir}定义在$ HADOOP_HONE / etc / hadoop / yarn-env.sh。
顺便说一句,如果您使用Hadoop 1.X
I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability.
During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important for me to keep all the MapReduce jobs up that I have previously launched. So, my question is:
¿How can I avoid Hadoop removes all MapReduce-job history after it is restarted?¿Is there some property to control job removing after Hadoop environment restarting?
the MR job history logs are not deleted right way after you restart hadoop, the new job will be counted from *_0001 and only new jobs which are started after hadoop restart will be displayed on resource manager web portal though. In fact, there are 2 log related settings from yarn default:
# this is where you can find the MR job history logs
yarn.nodemanager.log-dirs = ${yarn.log.dir}/userlogs
# this is how long the history logs will be retained
yarn.nodemanager.log.retain-seconds = 10800
and the default ${yarn.log.dir} is defined in $HADOOP_HONE/etc/hadoop/yarn-env.sh.
BTW, similar settings could also be found in mapred-env.sh if you are use Hadoop 1.X