Hadoop MultipleOutputFormat对org.apache.hadoop.mapreduce.Job的支持

本文介绍了Hadoop MultipleOutputFormat对org.apache.hadoop.mapreduce.Job的支持的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是Hadoop中的新手！现在，我正在尝试在hadoop 2.2.0中使用MultipleOutputFormat，但似乎它们仅与已弃用的"JobConf"一起使用，而后者又使用了已弃用的Mapper和Reducer(org.apache.hadoop.mapred.Reducer)等.关于如何使用新的"org.apache.hadoop.mapreduce.Job"实现多种输出功能的任何想法吗?

I'm a newbie in Hadoop!Now I am trying to use MultipleOutputFormat with hadoop 2.2.0, but it seems they only work with deprecated 'JobConf' which in turn uses deprecated Mapper and Reducer (org.apache.hadoop.mapred.Reducer) etc., . Any ideas how to to acheive multiple output functionality with new 'org.apache.hadoop.mapreduce.Job' ?

推荐答案

如@JudgeMental所述，您应将MultipleOutputs与新API(mapreduce)一起使用，因为MultipleOutputFormat仅支持旧API(mapred ).与MultipleOutputFormat相比，MultipleOutputs实际上为您提供了更多功能:

As @JudgeMental noted, you should use MultipleOutputs with the new API (mapreduce) because MultipleOutputFormat only supports the old API (mapred). MultipleOutputs actually provides you more features than MultipleOutputFormat:

对于MultipleOutputs，每个输出可以具有其自己的OutputFormat，而对于MultipleOutputFormat，每个输出必须具有相同的OutputFormat.
使用MultipleOutputFormat，与MultipleOutputs相比，您对命名方案和输出目录结构的控制更多.
您可以在同一作业的map和reduce函数中使用MultipleOutputs，这是MultipleOutputFormat无法做到的.
对于MultipleOutputs，您可以为不同的输出使用不同的键和值类型.

With MultipleOutputs, each output can have its own OutputFormat, whereas with MultipleOutputFormat every output has to be the same OutputFormat.
With MultipleOutputFormat you have more control over the naming scheme and output directory structure than MultipleOutputs.
You can use MultipleOutputs in the map and reduce functions in the same job, something that you cannot do with MultipleOutputFormat.
You can have different key and value types for different outputs with MultipleOutputs.

因此，尽管MultipleOutputs具有更多功能，但它们重新命名的能力较不灵活.

So both are not mutually exclusive, even if MultipleOutputs has more features, it is less flexible regrding the naming capabilities.

要了解如何使用MultipleOutputs，只需看一下本文档，其中包含完整的示例.简而言之，这是您要放入驱动程序类的内容:

To learn how to use MultipleOutputs, you should just take a look at this documentation which contains a complete example. In short, here is what you would put in the driver class:

// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, LongWritable.class, Text.class);

// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, LongWritable.class, Text.class);

在Mapper或Reducer中，您应该使用MultipleOutputs mos = new MultipleOutputs(context);在setup方法中初始化MultipleOutputs，然后可以在map和reduce函数中将其用作.不要忘记使用mos.close()在cleanup方法中将其关闭！

And in your Mapper or Reducer you should just initialize your MultipleOutputs in the setup method with MultipleOutputs mos = new MultipleOutputs(context); and then you can use it in the map and reduce functions as mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a"). Don't forget to close it in the cleanup method with mos.close() !

这篇关于Hadoop MultipleOutputFormat对org.apache.hadoop.mapreduce.Job的支持的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！