问题描述
在Hive中,我经常执行以下查询:
从...按...分组选择columnA,sum(columnB)
我读了一些mapreduce示例,一个reducer只能产生一个密钥.减速器的数量似乎完全取决于columnA中的键的数量.
因此,为什么要手动配置减速器的数量?
如果columnA中有10个不同的值,并且我将减速器的数量设置为 2 ,会发生什么?每个异径管将重复使用5次?
如果columnA中有10个不同的值,并且我将减速器的数量设置为 20 ,会发生什么?蜂巢只会产生10个减速器?
通常,您不应该手动设置减速器的确切数量.使用 bytes.per.reducer
代替:
-编译时确定的reduce任务数-默认大小为1G,因此如果估计的输入大小为10G,则将使用10个Reducer设置hive.exec.reducers.bytes.per.reducer = 67108864;
如果要通过作业减少程序限制群集使用,可以设置以下属性: hive.exec.reducers.max
如果您在Tez上运行,则在执行时,如果设置了此属性,则Hive可以动态设置reducer的数量:
set hive.tez.auto.reducer.parallelism = true;
在这种情况下,最初启动的减速器数量可能会更大,因为它是根据大小估算的,在运行时可以删除额外的减速器.
一个reducer可以处理许多键,这取决于数据大小和字节.per.reducer和reducer限制配置设置.像您的示例一样,在查询的情况下,相同的键将传递给相同的化简器,因为每个化简器容器都在隔离状态下运行,并且具有特定键的所有行都需要传递给单个化简器,以便能够计算该键的计数.
可以强制( mapreduce.job.reducers = N
)或根据错误的估算(由于过时的统计信息)自动启动额外的减速器,如果在运行时未将其删除,它们将无济于事并快速完成,因为没有什么要处理的.但是无论如何都会安排这样的减速器并分配容器,因此最好不要强制使用额外的减速器,并保持统计数据新鲜以进行更好的估算.In Hive I ofter do queries like:
select columnA, sum(columnB) from ... group by ...
I read some mapreduce example and one reducer can only produce one key. It seems the number of reducers completely depends on number of keys in columnA.
Therefore, why could hive set number of reducers manully?
If there are 10 different values in columnA and I set number of reducers to 2, what will happen? Each reducers will be reused 5 times?
If there are 10 different values in columnA and I set number of reducers to 20, what will happen? hive will only generate 10 reducers?
Normally you should not set the exact number of reducers manually. Use bytes.per.reducer
instead:
--The number of reduce tasks determined at compile time
--Default size is 1G, so if the input size estimated is 10G then 10 reducers will be used
set hive.exec.reducers.bytes.per.reducer=67108864;
If you want to limit cluster usage by job reducers, you can set this property: hive.exec.reducers.max
If you are running on Tez, at execution time Hive can dynamically set the number of reducers if this property is set:
set hive.tez.auto.reducer.parallelism = true;
In this case the number of reducers initially started may be bigger because it was estimated based on size, at runtime extra reducers can be removed.
One reducer can process many keys, it depends on data size and bytes.per.reducer and reducer limit configuration settings. The same keys will pass to the same reducer in case of query like in your example because each reducer container is running isolated and all rows having particular key need to be passed to single reducer to be able calculate count for this key.
Extra reducers can be forced (mapreduce.job.reducers=N
) or started automatically based on wrong estimation(because of stale stats) and if not removed at run-time, they will do nothing and finish quickly because there is nothing to process. But such reducers anyway will be scheduled and containers allocated, so better do not force extra reducers and keep stats fresh for better estimation.
这篇关于如果Hive变径器的数量与键的数量不同,将会发生什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!