问题描述
我是Hadoop的新手。我正在尝试Wordcount计划。现在尝试多个输出文件,我使用 MultipleOutputFormat
。这个链接帮助我做到了这一点。
在我的驱动程式中我有类
$ $ p $ $ code> MultipleOutputs.addNamedOutput(conf,even,
org.apache.hadoop.mapred.TextOutputFormat .class,Text.class,
IntWritable.class);
MultipleOutputs.addNamedOutput(conf,odd,
org.apache.hadoop.mapred.TextOutputFormat.class,Text.class,
IntWritable.class);`
和我的reduce类成了这个
public static class Reduce extends MapReduceBase implements
Reducer< Text,IntWritable,Text,IntWritable> {
MultipleOutputs mos = null;
public void configure(JobConf job){
mos = new MultipleOutputs(job);
}
$ b $ public void reduce(Text key,Iterator< IntWritable> values,
OutputCollector< Text,IntWritable>输出,Reporter记者)
抛出IOException {
int sum = 0;
while(values.hasNext()){
sum + = values.next()。get();
if(sum%2 == 0){
mos.getCollector(even,reporter).collect(key,new IntWritable(sum));
} else {
mos.getCollector(odd,reporter).collect(key,new IntWritable(sum));
}
//output.collect(key,new IntWritable(sum));
}
@Override
public void close()throws IOException {
// TODO自动生成的方法存根
mos.close();
事情很成功,但我得到很多文件,(一个奇数和一个偶数对于每个map-reduce)
问题是:我怎样才能有2个输出文件(奇数和偶数),以便每个奇数输出map-reduce被写入该奇数文件中,并且相同的偶数。
每个reducer使用OutputFormat将记录写入。所以这就是为什么你每个减速器都得到一组奇数和偶数的文件。这是通过设计,使每个reducer可以并行执行写入。
如果您只需要一个奇数和单个偶数文件,则需要设置mapred.reduce .tasks为1.但是性能会受到影响,因为所有的映射器都会被放入一个reducer中。
另外一个选择是更改进程读取这些文件以接受多个输入文件,或者编写将这些文件合并在一起的单独进程。I'm a newbie in Hadoop. I'm trying out the Wordcount program.
Now to try out multiple output files, i use MultipleOutputFormat
. this link helped me in doing it. http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
in my driver class i had
MultipleOutputs.addNamedOutput(conf, "even",
org.apache.hadoop.mapred.TextOutputFormat.class, Text.class,
IntWritable.class);
MultipleOutputs.addNamedOutput(conf, "odd",
org.apache.hadoop.mapred.TextOutputFormat.class, Text.class,
IntWritable.class);`
and my reduce class became this
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
MultipleOutputs mos = null;
public void configure(JobConf job) {
mos = new MultipleOutputs(job);
}
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
if (sum % 2 == 0) {
mos.getCollector("even", reporter).collect(key, new IntWritable(sum));
}else {
mos.getCollector("odd", reporter).collect(key, new IntWritable(sum));
}
//output.collect(key, new IntWritable(sum));
}
@Override
public void close() throws IOException {
// TODO Auto-generated method stub
mos.close();
}
}
Things worked , but i get LOT of files, (one odd and one even for every map-reduce)
Question is : How can i have just 2 output files (odd & even) so that every odd output of every map-reduce gets written into that odd file, and same for even.
Each reducer uses an OutputFormat to write records to. So that's why you are getting a set of odd and even files per reducer. This is by design so that each reducer can perform writes in parallel.
If you want just a single odd and single even file, you'll need to set mapred.reduce.tasks to 1. But performance will suffer, because all the mappers will be feeding into a single reducer.
Another option is to change the process the reads these files to accept multiple input files, or write a separate process that merges these files together.
这篇关于MultipleOutputFormat在hadoop中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!