问题描述
我总是会为配置单元中的特定任务创建多少个映射器和reduce。
例如,如果块大小= 128mb,并且有365个文件分别映射到一年中的某个日期(每个文件大小= 1 mb)。有基于日期列的分区。在这种情况下,加载数据期间将运行多少个映射器和reducers?
Mappers :
的映射依赖于各种因素,例如数据如何在节点间分布,输入格式,执行引擎和配置参数。另请参阅此处:
MR使用CombineInputFormat,而Tez使用分组分割。
Tez:
set tez.grouping.min-size = 16777216; - 16 MB分钟
set tez.grouping.max-size = 1073741824; - 1 GB最大分割
MapReduce:
set mapreduce.input.fileinputformat.split.minsize = 16777216; - 16 MB
set mapreduce.input.fileinputformat.split.minsize = 1073741824; - 1 GB
映射器也运行在数据所在的数据节点上,这就是为什么手动控制映射器的数量并不是一件容易的事情,并不总是可以将输入结合起来。
缩小器:
控制数量减速器要容易得多。
根据
mapred.reduce.tasks
确定的还原器的数量 - 每项工作的任务。通常设置为接近可用主机的数量。 mapred.job.tracker为本地时忽略。 Hadoop默认设置为1,而Hive使用-1作为默认值。通过将此属性设置为-1,Hive将自动计算出应该是减速器数量的数量。
hive.exec.reducers.bytes.per.reducer
- Hive 0.14.0及更早版本的默认值是1 GB。
同时 hive.exec.reducers.max
- 将使用的最大减速器数量。如果 mapred.reduce.tasks
为负数,Hive将在自动确定减速器数量时将此作为减速器的最大数量。
因此,如果您想增加reducer并行性,请增加 hive.exec.reducers.max
并减少 hive.exec.reducers。 bytes.per.reducer
I am always confused on how many mappers and reduces will get created for a particular task in hive.e.g If block size = 128mb and there are 365 files each maps to a date in a year(file size=1 mb each). There is partition based on date column. In this case how many mappers and reducers will be run during loading the data?
Mappers:
Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
MR uses CombineInputFormat, while Tez uses grouped splits.
Tez:
set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split
MapReduce:
set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB
Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.
Reducers:Controlling the number of reducers is much easier.The number of reducers determined according to
mapred.reduce.tasks
- The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer
- The default in Hive 0.14.0 and earlier is 1 GB.
Also hive.exec.reducers.max
- Maximum number of reducers that will be used. If mapred.reduce.tasks
is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.
So, if you want to increase reducers parallelism, increase hive.exec.reducers.max
and decrease hive.exec.reducers.bytes.per.reducer
这篇关于将为配置单元中的分区表创建多少个映射器和reduce的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!