问题描述
我有 100 个映射器和 1 个减速器在一个作业中运行.如何提高工作绩效?
I have 100 mapper and 1 reducer running in a job. How to improve the job performance?
据我了解:使用组合器可以在很大程度上提高性能.但是我们还需要配置什么来提高作业性能?
As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance?
推荐答案
由于本题数据有限(输入文件大小、HDFS块大小、平均map处理时间、Mapper槽数和集群中Reduce槽数等).),我们无法建议提示.
With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.
但是有一些通用的指导方针可以提高性能.
But there are some general guidelines to improve the performance.
- 如果每个任务花费的时间少于 30-40 秒,请减少任务数量
- 如果一个作业的输入超过1TB,考虑将输入数据集的块大小增加到256M甚至512M,这样任务的数量会更小.
- 只要每个任务至少运行 30-40 秒,将映射器任务的数量增加到集群中映射器槽数的几倍
- 每个作业的 reduce 任务数应等于或略小于集群中的 reduce 槽数.
- If each task takes less than 30-40 seconds, reduce the number of tasks
- If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
- So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
- Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.
更多提示:
- 使用正确的诊断工具正确配置集群
- 在将中间数据写入磁盘时使用压缩
- 调整地图数量按照上述提示减少任务
- 在适当的地方加入 Combiner
- 使用大多数适当的数据类型来呈现输出(当输出值的范围在
Integer
范围内时,不要使用LongWritable
.IntWritable
在这种情况下是正确的选择) - 重用
Writables
- 拥有正确的分析工具
- Configure the cluster properly with right diagnostic tools
- Use compression when you are writing intermediate data to disk
- Tune number of Map & Reduce tasks as per above tips
- Incorporate Combiner wherever it is appropriate
- Use Most appropriate data types for rendering Output ( Do not use
LongWritable
when range of output values are inInteger
range.IntWritable
is right choice in this case) - Reuse
Writables
- Have right profiling tools
看看这个 cloudera 文章了解更多提示.
Have a look at this cloudera article for some more tips.
这篇关于在 Hadoop 中提高 MapReduce 作业性能的技巧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!