本文介绍了Hadoop:如何在同一作业中输出不同的格式类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在一项作业中同时输出gziplzo格式.

I want to output gzip and lzo formats at the same time in one job.

我使用了MultipleOutputs,并添加了两个这样的命名输出:

I used MultipleOutputs, and add two named outputs like this:

MultipleOutputs.addNamedOutput(job, "LzoOutput", GBKTextOutputFormat.class, Text.class, Text.class);

GBKTextOutputFormat.setOutputCompressorClass(job, LzoCodec.class);

MultipleOutputs.addNamedOutput(job, "GzOutput", TextOutputFormat.class, Text.class, Text.class);

TextOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

(GBKTextOutputFormat是我自己写的,扩展了FileOutputFormat)

(GBKTextOutputFormat here is written by myself which extends FileOutputFormat)

它们用于减速器中,例如:

They are used in reducer like:

multipleOutputs.write("LzoOutput", NullWritable.get(), value, "/user/hadoop/lzo/"+key.toString());

multipleOutputs.write("GzOutput", NullWritable.get(), value, "/user/hadoop/gzip/"+key.toString());

结果是:

我可以在两条路径中获得输出,但是它们都是gzip格式.

I can get outputs in the two path, but they are both in gzip format.

有人可以帮助我吗?谢谢!

Someone can help me? Thanks!

================================================ =========================

==========================================================================

更多:

我只是在FileOutputFormat中查看了setOutputCompressorClass的源代码,其中conf.setClass("mapred.output.compression.codec", codecClass, CompressionCodec.class);

I just looked at the source code of setOutputCompressorClass in FileOutputFormat, in which conf.setClass("mapred.output.compression.codec", codecClass, CompressionCodec.class);

当调用setOutputCompressorClass时,似乎配置中的mapred.output.compression.codec将被重置.

It seems that mapred.output.compression.codec in configuration will be reset when setOutputCompressorClass is called.

因此,实际的压缩格式是我们最后设置的格式,我们不能在同一作业中设置两种不同的压缩格式吗?还是还有其他被忽略的东西?

So the actual compression format is the one we set at last, and we cannot set two different compression formats in the same job ? Or there is something else ignored ?

推荐答案

因此,作为一种解决方法,请尝试直接在配置中设置正确的outputCompressorClass

So maybe as a work-around, try setting the correct outputCompressorClass directly in the configuration

context.getConfiguration().setOutputCompressorClass(GzipCodec.class);

在对每个输出进行写调用之前.它看起来确实像不是键类,值类和输出路径的任何输出格式配置参数都不能由MultipleOutputs很好地处理,我们可能不得不编写一些代码来弥补这种疏忽.

just before your write call to each of the outputs. It does look like any output format configuration parameters other than key class, value class and output path are not handled well by MultipleOutputs and we may have to write a bit of code to offset that oversight.

这篇关于Hadoop:如何在同一作业中输出不同的格式类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 03:16