本文介绍了在Hadoop中分割Reducer输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Reduce操作产生的输出文件很大(Gzipping后1 GB)。我希望它产生中断输出到200 MB的较小文件。是否有一个属性/ Java类来按大小分割减少输出?线条?
我无法增加reducer的数量,因为这对hadoop工作的性能有负面影响。

我很好奇为什么你不能只使用更多的减速器,但我会听从你的意见。



你可以做的一个选择是使用MultipleOutputs并写入多个文件从一个减速器。例如,假设每个reducer的输出文件都是1GB,而您想要256MB文件。这意味着你需要为每个reducer写入4个文件而不是一个文件。



在您的作业驱动程序中,执行以下操作:

  JobConf conf = ...; 

//你应该把它作为参数传递给它,而不是硬编码。
conf.setInt(outputs.per.reducer,4);

//这将设置基础结构,以便为每个Reducer写入多个文件。
MultipleOutputs.addMultiNamedOutput(conf,multi,YourOutputFormat.class,YourKey.class,YourValue.class);

在你的reducer中,执行此操作:

  @Override 
public void configure(JobConf conf){
numFiles = conf.getInt(outputs.per.reducer,1);
multipleOutputs =新的MultipleOutputs(conf);

//其他初始化的东西
...
}

@Override
public void reduce(YourKey key
Iterator< ; YourValue> valuesIter,
OutputCollector< OutKey,OutVal> ignoreThis,
Reporter记者){
//像现在这样做你的业务逻辑。
OutKey outputKey = ...;
OutVal outputVal = ...;

//现在,它变得有趣了。散列值以查找
//输出数据应该写入的文件。不要使用
//键,因为如果reducer的数字
//是numFiles的倍数,所有数据都将写入一个文件。
int fileIndex =(outputVal.hashCode()& Integer.MAX_VALUE)%numFiles;

//现在使用多个输出来实际写入数据。
//这将为reducer 0创建名为multi_0-r-00000,multi_1-r-00000,
// multi_2-r-00000,multi_3-r-00000的输出文件。对于reducer 1,文件
//将是multi_0-r-00001,multi_1-r-00001,multi_2-r-00001,multi_3-r-00001。
multipleOutputs.getCollector(multi,Integer.toString(fileIndex),记者)
.collect(outputKey,outputValue);
}

@Overrider
public void close(){
//你必须这样做!
multipleOutputs.close();
}

这个伪代码是用旧的mapreduce api编写的。但是,使用mapreduce api存在等价的apis,所以无论如何,您应该全部设置好。


The output files produced by my Reduce operation is huge (1 GB after Gzipping). I want it produce break output into smaller files of 200 MB. Is there a property/Java class to split reduce output by size or no. of lines ?I can not increase the number of reducers because that has negative impact on performance of the hadoop job.

解决方案

I'm curious as to why you cannot just use more reducers, but I will take you at your word.

One option you can do is use MultipleOutputs and write to multiple files from one reducer. For example, say that the output file for each reducer is 1GB and you want 256MB files instead. This means you need to write 4 files per reducer rather than one file.

In your job driver, do this:

JobConf conf = ...;

// You should probably pass this in as parameter rather than hardcoding 4.
conf.setInt("outputs.per.reducer", 4);

// This sets up the infrastructure to write multiple files per reducer.
MultipleOutputs.addMultiNamedOutput(conf, "multi", YourOutputFormat.class, YourKey.class, YourValue.class);

In your reducer, do this:

@Override
public void configure(JobConf conf) {
  numFiles = conf.getInt("outputs.per.reducer", 1);
  multipleOutputs = new MultipleOutputs(conf);

  // other init stuff
  ...
}

@Override
public void reduce(YourKey key
                   Iterator<YourValue> valuesIter,
                   OutputCollector<OutKey, OutVal> ignoreThis,
                   Reporter reporter) {
    // Do your business logic just as you're doing currently.
    OutKey outputKey = ...;
    OutVal outputVal = ...;

    // Now this is where it gets interesting. Hash the value to find
    // which output file the data should be written to. Don't use the
    // key since all the data will be written to one file if the number
    // of reducers is a multiple of numFiles.
    int fileIndex = (outputVal.hashCode() & Integer.MAX_VALUE) % numFiles;

    // Now use multiple outputs to actually write the data.
    // This will create output files named: multi_0-r-00000, multi_1-r-00000,
    // multi_2-r-00000, multi_3-r-00000 for reducer 0. For reducer 1, the files
    // will be multi_0-r-00001, multi_1-r-00001, multi_2-r-00001, multi_3-r-00001.
    multipleOutputs.getCollector("multi", Integer.toString(fileIndex), reporter)
      .collect(outputKey, outputValue);
}

@Overrider
public void close() {
   // You must do this!!!!
   multipleOutputs.close();
}

This pseudo code was written with the old mapreduce api in mind. Equivalent apis exist using the mapreduce api, though, so either way, you should be all set.

这篇关于在Hadoop中分割Reducer输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 03:55