Hive 如何为作业选择减速器的数量?

本文介绍了Hive 如何为作业选择减速器的数量?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有几个地方说 Hadoop 作业中的默认减速器数量为 1.您可以使用 mapred.reduce.tasks 符号手动设置减速器的数量.

Several places say the default # of reducers in a Hadoop job is 1. You can use the mapred.reduce.tasks symbol to manually set the number of reducers.

当我运行 Hive 作业(在 Amazon EMR、AMI 2.3.3 上)时，它的 reducer 数量大于 1.查看作业设置，有些设置了 mapred.reduce.tasks，我认为是 Hive.它是如何选择那个号码的?

When I run a Hive job (on Amazon EMR, AMI 2.3.3), it has some number of reducers greater than one. Looking at job settings, something has set mapred.reduce.tasks, I presume Hive. How does it choose that number?

注意:以下是运行 Hive 作业时的一些提示信息:

Note: here are some messages while running a Hive job that should be a clue:

...
Number of reduce tasks not specified. Estimated from input data size: 500
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
...

推荐答案

默认值 1 可能适用于 vanilla Hadoop 安装.Hive 覆盖它.

The default of 1 maybe for a vanilla Hadoop install. Hive overrides it.

在开源配置单元中(可能还有 EMR)

In open source hive (and EMR likely)

# reducers = (# bytes of input to mappers)
             / (hive.exec.reducers.bytes.per.reducer)

这篇文章说的是默认的 hive.exec.reducers.bytes.per.reducer 是 1G.

This post says default hive.exec.reducers.bytes.per.reducer is 1G.

您可以使用 hive.exec.reducers.max 限制此启发式生成的减速器数量.

You can limit the number of reducers produced by this heuristic using hive.exec.reducers.max.

如果你确切地知道你想要的reducer的数量，你可以设置mapred.reduce.tasks，这将覆盖所有的启发式.(默认设置为 -1，表示 Hive 应该使用它的启发式方法.)

If you know exactly the number of reducers you want, you can set mapred.reduce.tasks, and this will override all heuristics. (By default this is set to -1, indicating Hive should use its heuristics.)

在某些情况下 - 比如说'select count(1) from T' - Hive 会将 reducer 的数量设置为 1，而不管输入数据的大小.这些被称为完整聚合"——如果查询所做的唯一一件事就是完整聚合——那么编译器就会知道来自映射器的数据将减少到微不足道的数量，并且运行多个减速器是没有意义的.

In some cases - say 'select count(1) from T' - Hive will set the number of reducers to 1 , irrespective of the size of input data. These are called 'full aggregates' - and if the only thing that the query does is full aggregates - then the compiler knows that the data from the mappers is going to be reduced to trivial amount and there's no point running multiple reducers.

这篇关于Hive 如何为作业选择减速器的数量?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！