本文介绍了Hadoop可以从多种输入格式中减少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在HDFS中我有两个具有不同数据格式的文件。如果我需要减少这两个数据文件,那么如何设置一个工作?

想象一下常见的字数问题,在一个文件中你有空间作为世界分隔符,在另一个文件中是下划线。在我的方法中,我需要不同的映射器来处理各种文件格式,而不是传递给一个通用的reducer。



如何做到这一点?
或者有比我更好的解决方案吗?

解决方案

查看解决这个确切问题的类。它非常简洁 - 你传入 InputFormat 和可选的 Mapper 类。



如果您在google上查找代码示例,请搜索Reduce-side join,这是通常使用此方法的地方。






另一方面,有时我发现使用hack更容易。例如,如果您有一组由空格分隔的文件,另一组是以下划线分隔的,则使用相同的映射器和 TextInputFormat 加载两个文件,并在两个可能的分隔符上加上标记。计算两个结果集中令牌的数量。在单词计数的例子中,选择一个更多的标记。



如果两个文件是相同的分隔符,但标准列的数量不同,这也适用。你可以在逗号上标记化,然后看看有多少个标记。如果是5令牌,它来自数据集A,如果它是7令牌,则它来自数据集B.


I have two files with different data formats in HDFS. How would a job set up look like, if I needed to reduce across both data files?

e.g. imagine the common word count problem, where in one file you have space as the world delimiter and in another file the underscore. In my approach I need different mappers for the various file formats, that than feed into a common reducer.

How to do that?Or is there a better solution than mine?

解决方案

Check out the MultipleInputs class that solves this exact problem. It's pretty neat-- you pass in the InputFormat and optionally the Mapper class.

If you are looking for code examples on google, search for "Reduce-side join", which is where this method is typically used.


On the other hand, sometimes I find it easier to just use a hack. For example, if you have one set of files that is space delimited and the other that is underscore delimited, load both with the same mapper and TextInputFormat and tokenize on both possible delimiters. Count the number of tokens from the two results set. In the word count example, pick the one with more tokens.

This also works if both files are the same delimiter but have a different number of standard columns. You can tokenize on comma then see how many tokens there are. If it is say 5 tokens it is from data set A, if it is 7 tokens it is from data set B.

这篇关于Hadoop可以从多种输入格式中减少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 04:46
查看更多