问题描述
我是Hadoop中的新手!现在,我正在尝试在hadoop 2.2.0中使用MultipleOutputFormat,但似乎它们仅与已弃用的"JobConf"一起使用,而后者又使用了已弃用的Mapper和Reducer(org.apache.hadoop.mapred.Reducer)等.关于如何使用新的"org.apache.hadoop.mapreduce.Job"实现多种输出功能的任何想法吗?
I'm a newbie in Hadoop!Now I am trying to use MultipleOutputFormat with hadoop 2.2.0, but it seems they only work with deprecated 'JobConf' which in turn uses deprecated Mapper and Reducer (org.apache.hadoop.mapred.Reducer) etc., . Any ideas how to to acheive multiple output functionality with new 'org.apache.hadoop.mapreduce.Job' ?
推荐答案
如@JudgeMental所述,您应将MultipleOutputs
与新API(mapreduce
)一起使用,因为MultipleOutputFormat
仅支持旧API(mapred
).与MultipleOutputFormat
相比,MultipleOutputs
实际上为您提供了更多功能:
As @JudgeMental noted, you should use MultipleOutputs
with the new API (mapreduce
) because MultipleOutputFormat
only supports the old API (mapred
). MultipleOutputs
actually provides you more features than MultipleOutputFormat
:
- 对于
MultipleOutputs
,每个输出可以具有其自己的OutputFormat
,而对于MultipleOutputFormat
,每个输出必须具有相同的OutputFormat
. - 使用
MultipleOutputFormat
,与MultipleOutputs
相比,您对命名方案和输出目录结构的控制更多. - 您可以在同一作业的
map
和reduce
函数中使用MultipleOutputs
,这是MultipleOutputFormat
无法做到的. - 对于
MultipleOutputs
,您可以为不同的输出使用不同的键和值类型.
- With
MultipleOutputs
, each output can have its ownOutputFormat
, whereas withMultipleOutputFormat
every output has to be the sameOutputFormat
. - With
MultipleOutputFormat
you have more control over the naming scheme and output directory structure thanMultipleOutputs
. - You can use
MultipleOutputs
in themap
andreduce
functions in the same job, something that you cannot do withMultipleOutputFormat
. - You can have different key and value types for different outputs with
MultipleOutputs
.
因此,尽管MultipleOutputs
具有更多功能,但它们重新命名的能力较不灵活.
So both are not mutually exclusive, even if MultipleOutputs
has more features, it is less flexible regrding the naming capabilities.
要了解如何使用MultipleOutputs
,只需看一下本文档,其中包含完整的示例.简而言之,这是您要放入驱动程序类的内容:
To learn how to use MultipleOutputs
, you should just take a look at this documentation which contains a complete example. In short, here is what you would put in the driver class:
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, LongWritable.class, Text.class);
在Mapper
或Reducer
中,您应该使用MultipleOutputs mos = new MultipleOutputs(context);
在setup
方法中初始化MultipleOutputs
,然后可以在map
和reduce
函数中将其用作.不要忘记使用mos.close()
在cleanup
方法中将其关闭!
And in your Mapper
or Reducer
you should just initialize your MultipleOutputs
in the setup
method with MultipleOutputs mos = new MultipleOutputs(context);
and then you can use it in the map
and reduce
functions as mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a")
. Don't forget to close it in the cleanup
method with mos.close()
!
这篇关于Hadoop MultipleOutputFormat对org.apache.hadoop.mapreduce.Job的支持的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!