伪分布式数字地图和减少任务

伪分布式数字地图和减少任务

本文介绍了伪分布式数字地图和减少任务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Hadoop的新手。我已经成功配置了伪分布式模式的hadoop设置。现在我想知道选择地图数量和减少任务的逻辑是什么。我们指的是什么?



谢谢 您无法一概而论如何设置mappers / reducers的数量。



映射器数量:您不能将映射器数量明确设置为某个数字(有些参数可以设置它,但它不生效)。这是由hadoop为给定的输入集创建的输入分割数决定的。你可以通过设置 mapred.min.split.size 参数来控制它。有关更多信息,请阅读的InputSplit部分。如果由于大量小文件而产生大量映射器,并且想要减少映射器的数量,那么您将需要合并来自多个文件的数据。阅读:,但通常情况下,默认的减速器数量应足够好。有时候需要单个报告文件,在这种情况下,您可能希望减少数量设置为1.



再次引用wiki:


I am newbie to Hadoop. I have successfully configured a hadoop setup in pseudo distributed mode. Now I would like to know what's the logic of choosing the number of map and reduce tasks. What do we refer to?

Thanks

解决方案

You cannot generalize how number of mappers/reducers are to be set.

Number of Mappers:You cannot set number of mappers explicitly to a certain number(There are parameters to set this but it doesn't come into effect). This is decided by the number of Input Splits created by hadoop for your given set of input. You may control this by setting mapred.min.split.size parameter. For more read the InputSplit section here. If you have a lot of mappers being generated due to huge amount of small files and you want to reduce number of mappers then you will need to combine data from more than one files. Read this: How to combine input files to get to a single mapper and control number of mappers.

To quote from the wiki page:

Number of Reducers:You can explicitly set the number of reducers. Just set the parameter mapred.reduce.tasks. There are guidelines for setting this number, but usually the default number of reducers should be good enough. At times a single report file is required, in those cases you might want number of reducers to be set to be 1.

Again to quote from wiki:

这篇关于伪分布式数字地图和减少任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-07 05:45