本文介绍了mapreduce的哪个部分/类是停止执行reduce任务的逻辑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述 在Hadoop MapReduce中,在所有映射器完成之前不会启动reducer。有人可以解释我在哪个部分/类/ codeline是这个逻辑实施?我在谈论Hadoop MapReduce版本1(不是Yarn)。我搜索了map reduce框架,但有很多类,我不太了解方法调用和它们的顺序。



换句话说,我需要测试目的),即使仍然有工作映射器,减少器开始减少。我知道这样我得到的结果是错误的,但是要知道这是改变框架部分的一些工作的开始。那么我应该在哪里开始寻找并进行修改?

解决方案

这是在洗牌阶段完成的。对于Hadoop 1.x,看看 org.apache.hadoop.mapred.ReduceTask.ReduceCopier ,它实现了 ShuffleConsumerPlugin 。您可能还想阅读Verma等人的研究论文。



编辑:



在阅读@ chris-white的回答后,我意识到我的答案需要额外的解释。在MapReduce模型中,您需要等待所有映射器完成,因为这些键需要进行分组和排序;此外,您可能会运行一些推测性映射器,但您不知道哪个重复映射器会先完成。然而,正如打破MapReduce阶段的障碍一文所指出的,对于某些应用程序来说,不等待映射器的所有输出是有意义的。如果你想实现这种行为(最可能用于研究目的),那么你应该看看我上面提到的类。


In Hadoop MapReduce no reducer starts before all mappers are finished. Can someone please explain me at which part/class/codeline is this logic implemented? I am talking about Hadoop MapReduce version 1 (NOT Yarn). I have searched the map reduce framework but there are so many classes and i don't understand much the method calls and their ordering.

In other words i need (first for test purposes) to let the reducers start reducing even if there are still working mappers. I know that this way i am getting false results for the job but for know this is the start of some work for changing parts of the framework. So where should i start to look and make changes?

解决方案

This is done in the shuffle phase. For Hadoop 1.x, take a look at org.apache.hadoop.mapred.ReduceTask.ReduceCopier, which implements ShuffleConsumerPlugin. You may also want to read the "Breaking the MapReduce Stage Barrier" research paper by Verma et al.

EDIT:

After reading @chris-white 's answer, I realized that my answer needed an extra explanation. In the MapReduce model, you need to wait for all mappers to finish, since the keys need to be grouped and sorted; plus, you may have some speculative mappers running and you do not know yet which of the duplicate mappers will finish first. However, as the "Breaking the MapReduce Stage Barrier" paper indicates, for some applications, it may make sense not to wait for all of the output of the mappers. If you would want to implement this sort of behavior (most likely for research purposes), then you should take a look at the classes I mentioned above.

这篇关于mapreduce的哪个部分/类是停止执行reduce任务的逻辑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-21 02:54