





查看解决这个确切问题的类。它非常简洁 - 你传入 InputFormat 和可选的 Mapper 类。

如果您在google上查找代码示例,请搜索Reduce-side join,这是通常使用此方法的地方。

另一方面,有时我发现使用hack更容易。例如,如果您有一组由空格分隔的文件,另一组是以下划线分隔的,则使用相同的映射器和 TextInputFormat 加载两个文件,并在两个可能的分隔符上加上标记。计算两个结果集中令牌的数量。在单词计数的例子中,选择一个更多的标记。


I have two files with different data formats in HDFS. How would a job set up look like, if I needed to reduce across both data files?

e.g. imagine the common word count problem, where in one file you have space as the world delimiter and in another file the underscore. In my approach I need different mappers for the various file formats, that than feed into a common reducer.

How to do that?Or is there a better solution than mine?


Check out the MultipleInputs class that solves this exact problem. It's pretty neat-- you pass in the InputFormat and optionally the Mapper class.

If you are looking for code examples on google, search for "Reduce-side join", which is where this method is typically used.

On the other hand, sometimes I find it easier to just use a hack. For example, if you have one set of files that is space delimited and the other that is underscore delimited, load both with the same mapper and TextInputFormat and tokenize on both possible delimiters. Count the number of tokens from the two results set. In the word count example, pick the one with more tokens.

This also works if both files are the same delimiter but have a different number of standard columns. You can tokenize on comma then see how many tokens there are. If it is say 5 tokens it is from data set A, if it is 7 tokens it is from data set B.


05-29 04:46