在 map 侧加入后,我在Reducer中获得的数据是
key------ book
values
6
eraser=>book 2
pen=>book 4
pencil=>book 5
我基本上想做的是
eraser=>book = 2/6
pen=>book = 4/6
pencil=>book = 5/6
我最初做的像
public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
System.out.println("key------ "+key);
System.out.println("Values");
for(Text value : values){
System.out.println("\t"+value.toString());
String v = value.toString();
double BsupportCnt = 0;
double UsupportCnt = 0;
double res = 0;
if(!v.contains("=>")){
BsupportCnt = Double.parseDouble(v);
}
else{
String parts[] = v.split(" ");
UsupportCnt = Double.parseDouble(parts[1]);
}
// calculate here
res = UsupportCnt/BsupportCnt;
}
如果传入数据如上所述,则可以正常工作
但是如果从映射器传入的数据是
key------ book
values
eraser=>book 2
pen=>book 4
pencil=>book 5
6
这行不通
否则,我需要将所有
=>
存储在一个列表中(如果传入数据是大数据,则该列表可能会占用堆空间),一旦我得到一个数字,就应该进行计算。更新
就像Vefthym要求对值进行二次排序之前,它到达 reducer 。
我用
htuple
来做同样的事情。我推荐this link
在mapper1中发出
eraser=>book 2
作为值所以
public class AprioriItemMapper1 extends Mapper<Text, Text, Text, Tuple>{
public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
//Configurations and other stuffs
//allWords is an ArrayList
if(allWords.size()<=2)
{
Tuple outputKey = new Tuple();
String LHS1 = allWords.get(1);
String RHS1 = allWords.get(0)+"=>"+allWords.get(1)+" "+value.toString();
outputKey.set(TupleFields.ALPHA, RHS1);
context.write(new Text(LHS1), outputKey);
}
//other stuffs
Mapper2发出
numbers
作为值public class AprioriItemMapper2 extends Mapper<Text, Text, Text, Tuple>{
Text valEmit = new Text();
public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
//Configuration and other stuffs
if(cnt != supCnt && cnt < supCnt){
System.out.println("emit");
Tuple outputKey = new Tuple();
outputKey.set(TupleFields.NUMBER, value);
System.out.println("v---"+value);
System.out.println("outputKey.toString()---"+outputKey.toString());
context.write(key, outputKey);
}
我只是试图打印键和值的Reducer
但这捕获了错误
Mapper 2:
line book
Support Count: 2
count--- 1
emit
v---6
outputKey.toString()---[0]='6,
14/08/07 13:54:19 INFO mapred.LocalJobRunner: Map task executor complete.
14/08/07 13:54:19 WARN mapred.LocalJobRunner: job_local626380383_0003
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
at org.htuple.TupleMapReducePartitioner.getPartition(TupleMapReducePartitioner.java:28)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:601)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:49)
at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:51)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
错误位于
context.write(new Text(LHS1), outputKey);
的AprioriItemMapper1.java:49
但是上面的打印细节来自Mapper 2任何更好的方式做到这一点
请提出建议。
最佳答案
我建议使用二次排序,这将保证第一个值(按字典顺序排序)是一个数字,假设不存在以数字开头的单词。
如果这行不通,那么,在您提到的可伸缩性限制的情况下,我会将化简器的值存储在HashMap<String,Double>
缓冲区中,其中键是“=>”的左侧部分,而值是其数字值。
您可以存储值,直到获得分母BsupportCnt
的值。然后,您可以发出具有正确分数的所有缓冲区内容,以及所有剩余值,当它们一一对应时,而无需再次使用该缓冲区(因为您现在知道分母)。像这样:
public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
Map<String,Double> buffer = new HashMap<>();
double BsupportCnt = 0;
double UsupportCnt;
double res;
for(Text value : values){
String v = value.toString();
if(!v.contains("=>")){
BsupportCnt = Double.parseDouble(v);
} else {
String parts[] = v.split(" ");
UsupportCnt = Double.parseDouble(parts[1]);
if (BsupportCnt != 0) { //no need to add things to the buffer any more
res = UsupportCnt/BsupportCnt;
context.write(new Text(v), new DoubleWritable(res));
} else {
buffer.put(parts[0], UsupportCnt);
}
}
}
//now emit the buffer's contents
for (Map<String,Double>.Entry entry : buffer) {
context.write(new Text(entry.getKey()), new DoubleWritable(entry.getValue()/BsupportCnt));
}
}
通过仅将“=>”的左侧部分存储为HashMap的键,您可以获得更多空间,因为右侧部分始终是化简器的输入键。
关于java - 如何在hadoop中管理联接-MultipleInputPath,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/25160703/