本文介绍了在 Hadoop 中提高 MapReduce 作业性能的技巧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 100 个映射器和 1 个减速器在一个作业中运行.如何提高工作绩效?

I have 100 mapper and 1 reducer running in a job. How to improve the job performance?

据我了解:使用组合器可以在很大程度上提高性能.但是我们还需要配置什么来提高作业性能?

As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance?

推荐答案

由于本题数据有限(输入文件大小、HDFS块大小、平均map处理时间、Mapper槽数和集群中Reduce槽数等).),我们无法建议提示.

With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.

但是有一些通用的指导方针可以提高性能.

But there are some general guidelines to improve the performance.

  1. 如果每个任务花费的时间少于 30-40 秒,请减少任务数量
  2. 如果一个作业的输入超过1TB,考虑将输入数据集的块大小增加到256M甚至512M,这样任务的数量会更小.
  3. 只要每个任务至少运行 30-40 秒,将映射器任务的数量增加到集群中映射器槽数的几倍
  4. 每个作业的 reduce 任务数应等于或略小于集群中的 reduce 槽数.
  1. If each task takes less than 30-40 seconds, reduce the number of tasks
  2. If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
  3. So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
  4. Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.

更多提示:

  1. 使用正确的诊断工具正确配置集群
  2. 在将中间数据写入磁盘时使用压缩
  3. 调整地图数量按照上述提示减少任务
  4. 在适当的地方加入 Combiner
  5. 使用大多数适当的数据类型来呈现输出(当输出值的范围在 Integer 范围内时,不要使用 LongWritable.IntWritable 在这种情况下是正确的选择)
  6. 重用Writables
  7. 拥有正确的分析工具
  1. Configure the cluster properly with right diagnostic tools
  2. Use compression when you are writing intermediate data to disk
  3. Tune number of Map & Reduce tasks as per above tips
  4. Incorporate Combiner wherever it is appropriate
  5. Use Most appropriate data types for rendering Output ( Do not use LongWritable when range of output values are in Integer range. IntWritable is right choice in this case)
  6. Reuse Writables
  7. Have right profiling tools

看看这个 cloudera 文章了解更多提示.

Have a look at this cloudera article for some more tips.

这篇关于在 Hadoop 中提高 MapReduce 作业性能的技巧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 05:07
查看更多