问题描述
我有许多小的输入文件,我想用一些输入格式(如 CombineFileInputFormat
)来组合它们来启动更少的映射器任务。我知道我可以使用Java API来做到这一点,但我不知道在使用Hadoop流式传输时是否有流式jar库支持此功能。
I have many small input files, and I want to combine them using some input format like CombineFileInputFormat
to launch fewer mapper tasks. I know I can use Java API to do this, but I don't know whether there's a streaming jar library to support this function while I'm using Hadoop streaming.
推荐答案
Hadoop流默认使用 TextInputFormat
,但可以使用任何其他输入格式,包括 CombineFileInputFormat
。您可以使用选项 -inputformat
从命令行更改输入格式。确保使用旧的API并实现 org.apache.hadoop.mapred.lib.CombineFileInputFormat
。
Hadoop streaming uses TextInputFormat
by default but any other input format can be used, including CombineFileInputFormat
. You can change the input format from the command line, using the option -inputformat
. Be sure to use the old API and implement org.apache.hadoop.mapred.lib.CombineFileInputFormat
. The new API isn't supported yet.
$HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar \
-inputformat foo.bar.MyCombineFileInputFormat \
-Dmapred.max.split.size=524288000 \
-Dstream.map.input.ignoreKey=true \
...
这篇关于有没有一个结合输入格式的hadoop流?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!