hadoop - 改善Wordcount中的身份映射器

我创建了一个map方法，该方法读取wordcount示例[1]的map输出。本示例不使用MapReduce提供的IdentityMapper.class，但这是我发现为Wordcount制作有效的IdentityMapper的唯一方法。唯一的问题是，此Mapper花费的时间比我想要的要多得多。我开始认为也许我正在做一些多余的事情。对改善WordCountIdentityMapper代码有帮助吗？

[1]身份映射器

public class WordCountIdentityMapper extends MyMapper<LongWritable, Text, Text, IntWritable> {
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        word.set(itr.nextToken());
        Integer val = Integer.valueOf(itr.nextToken());
        context.write(word, new IntWritable(val));
    }

    public void run(Context context) throws IOException, InterruptedException {
        while (context.nextKeyValue()) {
            map(context.getCurrentKey(), context.getCurrentValue(), context);
        }
    }
}

[2]生成mapoutput的Map类

public static class MyMap extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());

        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }

    public void run(Context context) throws IOException, InterruptedException {
        try {
            while (context.nextKeyValue()) {
                map(context.getCurrentKey(), context.getCurrentValue(), context);
            }
        } finally {
            cleanup(context);
        }
    }
}

谢谢，

最佳答案

解决方案是用StringTokenizer方法替换indexOf()。效果更好。我得到了更好的表现。