本文介绍了Hadoop文件拆分:CompositeInputFormat:内部联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用CompositeInputFormat向Hadoop作业提供输入.

I am using CompositeInputFormat to provide input to a hadoop job.

生成的分割数是作为CompositeInputFormat输入(用于加入)的文件总数.

The number of splits generated is the total number of files given as input to CompositeInputFormat ( for joining ).

这项工作完全忽略了块大小和最大拆分大小(从CompositeInputFormat获取输入时).这会导致Map Tasks长时间运行,并且由于输入文件大于块大小而使系统运行缓慢.

The job is completely ignoring the block size and max split size ( while taking input from CompositeInputFormat). This is resulting into long running Map Tasks and is making system slow as the input files are larger than the block size.

有人知道可以通过哪种方法来管理CompositeInputFormat的拆分数量吗?

Is anyone aware of any way through which the number of splits can be managed for CompositeInputFormat?

推荐答案

不幸的是,CompositeInputFormat必须忽略块/拆分大小.在CompositeInputFormat中,需要对输入文件进行相同的排序和分区...因此,Hadoop无法确定在何处分割文件以维护此属性.它无法确定在何处分割文件以保持文件井井有条.

Unfortunately, CompositeInputFormat has to ignore the block/split size. In CompositeInputFormat, the input files need to be sorted and partitioned identically... therefore, Hadoop has no way to determine where to split the file to maintain this property. It has no way to determine where to split the file to keep the files organized.

解决此问题的唯一方法是将文件手动拆分和分区为较小的拆分.您可以通过将数据传递给带有大量缩减器的mapreduce作业(可能只是身份映射器和身份缩减器)来实现.只需确保使用相同数量的化简器传递两个数据集即可.

The only way to get around this is to split and partition the files manually into smaller splits. You can do this by passing the data through a mapreduce job (probably just identity mapper and identity reducer) with a larger amount of reducers. Just be sure to pass both of your data sets through with the same number of reducers.

这篇关于Hadoop文件拆分:CompositeInputFormat:内部联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 03:31