本文介绍了Hadoop的/的map-reduce:所有地图中占用的插槽与所有map任务花费的总时间花费的总时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:我在分析AWS Hadoop作业的各种集群配置的性能和一些Hadoop的柜台都是混乱。

Background: I'm analyzing the performance of AWS Hadoop jobs on various cluster configurations and some of the Hadoop counters are confusing.

问:什么是由占用的插槽的所有地图所用的总时间和所有map任务花费的总时间之间的区别? (为减少同样的问题)。为了简便起见,我们姑且称之为这些计数器马坡,MAPT,重做和REDT。下面是我在三种不同的配置(各家有各家的多项核心/从节点)已经看到了:

Question: what's the difference between "Total time spent by all maps in occupied slots" and "Total time spent by all map tasks"? (same question for reduce). For brevity, let's call these counters mapO, mapT, redO and redT. Here's what I've seen in three different configurations (each with various number of core/slave nodes):

1)为AWS / EMR作业(Hadoop的2.4.0-AMZN-3),麻婆/ MAPT的比总是6.0和重做/ REDT的比总是12.0

1) For AWS/EMR jobs (Hadoop 2.4.0-amzn-3), the ratio of mapO / mapT is always 6.0 and the ratio of redO / redT is always 12.0.

2)对于使用实例存储手动安装的Hadoop(Hadoop的2.4.0.2.1.5.0-695),MAPO / MAPT的比总是1.0但重做/ REDT的比率有时是1.0,有时2.0

2) For manually installed Hadoop (Hadoop 2.4.0.2.1.5.0-695) using instance storage, the ratio of mapO / mapT is always 1.0 but the ratio of redO / redT is sometimes 1.0 and sometimes 2.0.

3)对于手动安装Hadoop的使用EBS存储,马坡/ MAPT的比例始终是1.0和重做/ REDT的比例始终是2.0。

3) For manually installed Hadoop using EBS storage, the ratio of mapO / mapT is always 1.0 and the ratio of redO / redT is always 2.0.

我假设其他的配置会有不同的比例,但做这些计数器/定时器实际测量?

I'm assuming other configurations would have different ratios but what do these counters/timers actually measure?

我买汤姆怀特的出色的Hadoop一书(第三版),但有没有麻婆提或重做柜台特别或一般占用的插槽。

I bought Tom White's excellent "Hadoop" book (3rd Edition) but there is no mention of the mapO or redO counters in particular or "occupied slots" in general.

我也运行大量的谷歌搜索和浏览hadoop.apache.com几十页。我也有(和运行)的Hadoop在我的MacBook和搜索的code这些计数器,无法找到它(我敢肯定它的存在,但??)。

I've also run lots of Google searches and viewed dozens of pages on hadoop.apache.com. I also have (and run) hadoop on my MacBook and searched for the code for these counters and couldn't find it (I'm sure it's there but??).

作为一个相关的(和解答)问题指出,这是令人惊讶和奇怪,连这些基本的计数器的基本描述是不容易买到。

As noted in a related (and unanswered) question, it is surprising and weird that even a basic description of these basic counters is not readily available.

推荐答案

在code,由占用插槽(MS)的所有地图所用的总时间重新$ P $由枚举SLOTS_MILLIS_MAPS psented(或SLOTS_MILLIS_REDUCES)在JobCounter.java。这些常量是德precated。它们由VS需要一根纱线时隙的最小MB任务持续时间由用于映射任务的MB的比值相乘得到他们的号码。

In the code, "Total time spent by all maps in occupied slots (ms)" is represented by the enum SLOTS_MILLIS_MAPS (or SLOTS_MILLIS_REDUCES) in JobCounter.java. Those constants are deprecated. They get their numbers by multiplying the task duration by the ratio of the MB used for the map task vs the minimum MB needed for one yarn slot.

所以,如果你的地图任务使用4 MB,最小插槽大小为1 MB,那么你的任务历时4 *持续时间本来可以用于其他任务的时间。这可以解释为什么您会看到不同的设置不同的比例。我没有找到该指标是特别有用的(特别是因为它目前尚不清楚它甚至意味着没有潜入code)。

So, if your map task used 4 MB and the minimum slot size is 1 MB, then your task took 4*duration of time that could have been used for other tasks. That would explain why you see different ratios for different setups. I don't find that metric to be particularly useful (especially since it isn't clear what it even means without diving into the code).

这篇关于Hadoop的/的map-reduce:所有地图中占用的插槽与所有map任务花费的总时间花费的总时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-28 06:08