问题描述
在Hadoop MapReduce中可以有多个不同映射器的输入吗?每个映射器类都在不同的输入集上工作,但它们都会发出同一个reducer所使用的键值对。请注意,我不是在讨论在这里链接映射器,我在说的是并行地运行不同的映射器,而不是按顺序运行。
解决方案在Hadoop MapReduce中可以有多个不同映射器的输入吗?每个映射器类都在不同的输入集上工作,但它们都会发出同一个reducer所使用的键值对。请注意,我不是在讨论在这里链接映射器,我在说的是并行地运行不同的映射器,而不是按顺序运行。
解决方案这就是所谓的连接。
你想在mapred。*包中使用mappers和reducers(旧的,但仍然支持)。较新的包(mapreduce。*)只允许一个映射器输入。使用mapred软件包,您可以使用MultipleInputs类来定义连接:
MultipleInputs.addInputPath(jobConf,
new Path(countsSource),
SequenceFileInputFormat.class,
CountMapper.class);
MultipleInputs.addInputPath(jobConf,
新路径(dictionarySource),
SomeOtherInputFormat.class,
TranslateMapper.class);
jobConf.setJarByClass(ReportJob.class);
jobConf.setReducerClass(WriteTextReducer.class);
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(WordInfo.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);
Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce? Each mapper class work on a different set of inputs, but they would all emit key-value pairs consumed by the same reducer. Note that I'm not talking about chaining mappers here, I'm talking about running different mappers in parallel, not sequentially.
This is called a join.
You want to use the mappers and reducers in the mapred.* packages (older, but still supported). The newer packages (mapreduce.*) only allow for one mapper input. With the mapred packages, you use the MultipleInputs class to define the join:
MultipleInputs.addInputPath(jobConf,
new Path(countsSource),
SequenceFileInputFormat.class,
CountMapper.class);
MultipleInputs.addInputPath(jobConf,
new Path(dictionarySource),
SomeOtherInputFormat.class,
TranslateMapper.class);
jobConf.setJarByClass(ReportJob.class);
jobConf.setReducerClass(WriteTextReducer.class);
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(WordInfo.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);
这篇关于在Hadoop MapReduce中可以有多个不同映射器的输入吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!