本文介绍了预期在Hadoop 0.21.0中使用打开的文件描述符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
限时删除!!
给定,该框架做了什么假设关于相对于每个单独映射的打开文件描述符的数量并减少操作?具体来说,哪些子操作会导致Hadoop在作业执行期间打开新的文件描述符或者溢出到磁盘?
(这是故意忽略使用,因为它非常清楚地表达了与系统。)
我的理由很简单:我希望确保为Hadoop编写的每个作业都能为每个映射器或缩减器保证有限数量的所需文件描述符。 Hadoop高兴地从编程人员那里抽象出来,这通常是一件好事,如果不是在服务器管理期间其他鞋子掉落的话。
我本来是从集群管理方面。由于我也负责编程,所以这里的问题同样适用。
解决方案 ,提供了一些有关这个问题的见解:
这意味着,对于正常行为,映射器的数量完全等于打开的文件描述符的数量。 MultipleOutputs
显然会将此数字乘以映射器数量乘以可用分区数量。 Reducer然后照常进行,每次reduce操作产生一个文件(因此,一个文件描述符)。
然后问题变成:在
溢出期间
操作中,大多数这些文件都由每个映射器保持打开状态,因为输出会通过拆分高兴地进行操作。因此,可用的文件描述符的问题。
因此,当前假定的最大文件描述符限制应该是:
blockquote >
映射阶段:映射器数量*可能的总分区数量
缩小阶段:减少操作次数*总分区可能
而且,正如我们所说,那是。
Given Hadoop 0.21.0, what assumptions does the framework make regarding the number of open file descriptors relative to each individual map and reduce operation? Specifically, what suboperations cause Hadoop to open a new file descriptor during job execution or spill to disk?
(This is deliberately ignoring use of MultipleOutputs
, as it very clearly screws with the guarantees provided by the system.)
My rationale here is simple: I'd like to ensure each job I write for Hadoop guarantees a finite number of required file descriptors for each mapper or reducer. Hadoop cheerfully abstracts this away from the programmer, which would normally be A Good Thing, if not for the other shoe dropping during server management.
I'd originally asked this question on Server Fault from the cluster management side of things. Since I'm also responsible for programming, this question is equally pertinent here.
解决方案
Here's a post that offers some insight into the problem:
This implies that, for normal behavior, the number of mappers is exactly equivalent to the number of open file descriptors. MultipleOutputs
obviously skews this number by the number of mappers multiplied by the number of available partitions. Reducers then proceed as normal, generating one file (and thus, one file descriptor) per reduce operation.
The problem then becomes: during a spill
operation, most of these files are being held open by each mapper as output is cheerfully martialled by split. Hence the available file descriptors problem.
Thus, the currently-assumed, maximum file descriptor limit should be:
And that, as we say, is that.
这篇关于预期在Hadoop 0.21.0中使用打开的文件描述符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
1403页,肝出来的..