问题描述
我需要将Map Reduce jar文件拆分为两个作业,以便获得两个不同的输出文件,一个来自两个作业的每个reducers。
我的意思是第一份工作必须产生一个输出文件,它将成为链中第二份工作的输入。
我读了一些关于ChainMapper和ChainReducer的hadoop版本0.20(目前我使用的是0.18):那些可以满足我的需求?
任何人都可以建议我使用这些方法找到一些示例的链接吗?或者,也许有另一种方式来实现我的问题?
谢谢,
Luca
有很多方法可以做到这一点。
$ b
-
级联作业
为第一个作业创建JobConf对象job1,并将所有参数设置为input作为输入目录,temp作为输出目录。执行此作业:
JobClient.run(job1)
。
紧接着它,创建JobConf对象job2对于第二个作业,并将所有参数设置为temp作为输入目录,输出作为输出目录。执行此任务:
JobClient.run(job2)
。 -
两个JobConf对象
创建两个JobConf对象,并将其中的所有参数设置为(1),除了不使用JobClient.run。
Job job1 = new Job(jobconf1); Job job2 = new Job(jobconf2);
使用jobControl对象指定作业依赖关系,然后运行作业:
JobControl jbcntrl = new JobControl(jbcntrl);
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
job2.addDependingJob(job1);
jbcntrl.run();
-
ChainMapper和ChainReducer
如果你需要一个像Map + |的结构减少| Map *,您可以使用Hadoop版本0.19及之后的ChainMapper和ChainReducer类。请注意,在这种情况下,您只能使用一个缩减器,而不使用任何数量的映射器。
I need to split my Map Reduce jar file in two jobs in order to get two different output file, one from each reducers of the two jobs.
I mean that the first job has to produce an output file that will be the input for the second job in chain.
I read something about ChainMapper and ChainReducer in hadoop version 0.20 (currently I am using 0.18): those could be good for my needs?
Can anybody suggest me some links where to find some examples in order to use those methods? Or maybe there are another way to achieve my issue?
Thank you,
Luca
There are many ways you can do it.
Cascading jobs
Create the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job:
JobClient.run(job1)
.Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Execute this job:
JobClient.run(job2)
.Two JobConf objects
Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run.
Then create two Job objects with jobconfs as parameters:
Job job1=new Job(jobconf1); Job job2=new Job(jobconf2);
Using the jobControl object, you specify the job dependencies and then run the jobs:
JobControl jbcntrl=new JobControl("jbcntrl"); jbcntrl.addJob(job1); jbcntrl.addJob(job2); job2.addDependingJob(job1); jbcntrl.run();
ChainMapper and ChainReducer
If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards. Note that in this case, you can use only one reducer but any number of mappers before or after it.
这篇关于Map Reduce:ChainMapper和ChainReducer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!