有没有一个结合输入格式的hadoop流？

本文介绍了有没有一个结合输入格式的hadoop流？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有许多小的输入文件，我想用一些输入格式（如 CombineFileInputFormat ）来组合它们来启动更少的映射器任务。我知道我可以使用Java API来做到这一点，但我不知道在使用Hadoop流式传输时是否有流式jar库支持此功能。

I have many small input files, and I want to combine them using some input format like CombineFileInputFormat to launch fewer mapper tasks. I know I can use Java API to do this, but I don't know whether there's a streaming jar library to support this function while I'm using Hadoop streaming.

推荐答案

Hadoop流默认使用 TextInputFormat ，但可以使用任何其他输入格式，包括 CombineFileInputFormat 。您可以使用选项 -inputformat 从命令行更改输入格式。确保使用旧的API并实现 org.apache.hadoop.mapred.lib.CombineFileInputFormat 。

Hadoop streaming uses TextInputFormat by default but any other input format can be used, including CombineFileInputFormat. You can change the input format from the command line, using the option -inputformat. Be sure to use the old API and implement org.apache.hadoop.mapred.lib.CombineFileInputFormat. The new API isn't supported yet.

$HADOOP_HOME/bin/hadoop jar \
      $HADOOP_HOME/hadoop-streaming.jar \
      -inputformat foo.bar.MyCombineFileInputFormat \
      -Dmapred.max.split.size=524288000 \
      -Dstream.map.input.ignoreKey=true \
      ...

这篇关于有没有一个结合输入格式的hadoop流？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！