本文介绍了增加Hadoop 2中Hive映射器的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从Hive创建了一个HBase表,我试图对它做一个简单的聚合。这是我的Hive查询:

  from my_hbase_table 
选择col1,count(1)
group by col1 ;

地图缩减工作只产生2个mapper,我想增加它。用简单的地图减少工作,我会配置纱线和mapper内存来增加mapper的数量。我在Hive中尝试了以下方法,但它不起作用:

  set yarn.nodemanager.resource.cpu-vcores = 16; 
set yarn.nodemanager.resource.memory-mb = 32768;
set mapreduce.map.cpu.vcores = 1;
set mapreduce.map.memory.mb = 2048;

注意:


  • 我的测试集群只有2个节点

  • HBase表有超过500万条记录

  • Hive日志显示HiveInputFormat和一些split = 2


解决方案

将文件拆分为较小的默认值并不是有效的解决方案。 Spiting基本上是在处理大型数据集时使用的。默认值本身就是一个小尺寸,所以它不值得再分割它。



我会在查询之前推荐以下配置。您可以根据您的输入数据应用它。

  set hive.merge.mapfiles = false; 

set hive.input.format = org.apache.hadoop.hive.ql.io.HiveInputFormat;

set mapred.map.tasks = XX;

如果您想分配减速器的数量,那么您可以使用下面的配置

  set mapred.reduce.tasks = XX; 

请注意,在Hadoop 2(YARN)上, mapred.map.tasks mapred.reduce.tasks 已被弃用,并被其他变量取代:

  mapred.map.tasks  - > mapreduce.job.maps 
mapred.reduce.tasks - > mapreduce.job.reduces

请参阅下面与此相关的有用链接







映射器如何分配



映射器的数量由MapReduce作业中使用的InputFormat确定的分割数决定。
在典型的InputFormat中,它与文件和文件大小成正比。



假设您的HDFS块配置配置为64MB(默认大小)并且你有一个100MB大小的文件
,那么它将占用2个块,然后2个映射器将根据块被分配



但假设你有2个文件的大小为30MB(每个文件),那么每个文件将占用一个块,并且映射器将根据该文件获得assigend



当您处理大小文件的数量,Hive默认使用CombineHiveInputFormat。
就MapReduce而言,它最终转化为使用CombineFileInputFormat,它可以在多个文件上创建虚拟分割
,并在可能的情况下通过公共节点,机架进行分组。合并拆分的大小由

  mapred.max.split.size 

mapreduce .input.fileinputformat.split.maxsize(in yarn / MR2);

所以如果你想分割更少(更少的mapper),你需要将这个参数设置得更高。 / p>

这个链接可以帮助您更好地理解它。



另外映射器和缩减器的数量总是依赖于您的集群的可用映射器和缩减器槽。


I created a HBase table from Hive and I'm trying to do a simple aggregation on it. This is my Hive query:

from my_hbase_table
select col1, count(1)
group by col1;

The map reduce job spawns only 2 mappers and I'd like to increase that. With a plain map reduce job I would configure the yarn and mapper memory to increase the number of mappers. I tried the following in Hive but it did not work:

set yarn.nodemanager.resource.cpu-vcores=16;
set yarn.nodemanager.resource.memory-mb=32768;
set mapreduce.map.cpu.vcores=1;
set mapreduce.map.memory.mb=2048;

NOTE:

  • My test cluster has only 2 nodes
  • The HBase table has more than 5M records
  • Hive logs show HiveInputFormat and a number of splits=2

解决方案

Split the file lesser then default value is not a efficient solution. Spiting is basically used during dealing with large dataset. Default value is itself a small size so its not worth to split it again.

I would recommend following configuration before your query.You can apply it based upon your input data.

set hive.merge.mapfiles=false;

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

set mapred.map.tasks = XX;

If you want to assign number of reducer also then you can use below configuration

set mapred.reduce.tasks = XX;

Note that on Hadoop 2 (YARN), the mapred.map.tasks and mapred.reduce.tasks are deprecated and are replaced by other variables:

mapred.map.tasks     -->    mapreduce.job.maps
mapred.reduce.tasks  -->    mapreduce.job.reduces

Please refer below useful link related to this

http://answers.mapr.com/questions/5336/limit-mappers-and-reducers-for-specific-job.html

Fail to Increase Hive Mapper Tasks?

How mappers get assigned

Number of mappers is determined by the number of splits determined by the InputFormat used in the MapReduce job.In a typical InputFormat, it is directly proportional to the number of files and file sizes.

suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB sizethen it will occupy 2 block and then 2 mapper will get assigned based on the blocks

but suppose if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigendbased on that.

When you are working with a large number of small files, Hive uses CombineHiveInputFormat by default.In terms of MapReduce, it ultimately translates to using CombineFileInputFormat that creates virtual splitsover multiple files, grouped by common node, rack when possible. The size of the combined split is determined by

mapred.max.split.size
or
mapreduce.input.fileinputformat.split.maxsize ( in yarn/MR2);

So if you want to have less splits(less mapper) you need to set this parameter higher.

This link can be useful to understand more on it.

What is the default size that each Hadoop mapper will read?

Also number of mappers and reducers are always dependent of available mapper and reducer slots of your cluster.

这篇关于增加Hadoop 2中Hive映射器的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 03:01