java - 如何在hadoop中管理联接-MultipleInputPath

在 map 侧加入后，我在Reducer中获得的数据是

key------ book
values
    6
    eraser=>book 2
    pen=>book 4
    pencil=>book 5

我基本上想做的是

eraser=>book = 2/6
pen=>book = 4/6
pencil=>book = 5/6

我最初做的像

public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{

        System.out.println("key------ "+key);
        System.out.println("Values");
        for(Text value : values){
            System.out.println("\t"+value.toString());
            String v = value.toString();
            double BsupportCnt = 0;
            double UsupportCnt = 0;
            double res = 0;
            if(!v.contains("=>")){
                BsupportCnt = Double.parseDouble(v);
            }
            else{
                String parts[] = v.split(" ");
                UsupportCnt = Double.parseDouble(parts[1]);
            }
//          calculate here
            res = UsupportCnt/BsupportCnt;

        }

如果传入数据如上所述，则可以正常工作

但是如果从映射器传入的数据是

key------ book
values
    eraser=>book 2
    pen=>book 4
    pencil=>book 5
    6

这行不通
否则，我需要将所有=>存储在一个列表中(如果传入数据是大数据，则该列表可能会占用堆空间)，一旦我得到一个数字，就应该进行计算。

更新
就像Vefthym要求对值进行二次排序之前，它到达 reducer 。
我用htuple来做同样的事情。
我推荐this link

在mapper1中发出eraser=>book 2作为值
所以

public class AprioriItemMapper1 extends Mapper<Text, Text, Text, Tuple>{
    public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
        //Configurations and other stuffs
        //allWords is an ArrayList
        if(allWords.size()<=2)
        {
            Tuple outputKey = new Tuple();
            String LHS1 = allWords.get(1);
            String RHS1 = allWords.get(0)+"=>"+allWords.get(1)+" "+value.toString();
            outputKey.set(TupleFields.ALPHA, RHS1);
            context.write(new Text(LHS1), outputKey);
                 }
//other stuffs

Mapper2发出numbers作为值

public class AprioriItemMapper2 extends Mapper<Text, Text, Text, Tuple>{
    Text valEmit = new Text();
    public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
        //Configuration and other stuffs
        if(cnt != supCnt && cnt < supCnt){
            System.out.println("emit");
            Tuple outputKey = new Tuple();
            outputKey.set(TupleFields.NUMBER, value);

            System.out.println("v---"+value);
            System.out.println("outputKey.toString()---"+outputKey.toString());
            context.write(key, outputKey);
        }

我只是试图打印键和值的Reducer

但这捕获了错误

Mapper 2:
line book
Support Count: 2
count--- 1
emit
v---6
outputKey.toString()---[0]='6,
14/08/07 13:54:19 INFO mapred.LocalJobRunner: Map task executor complete.
14/08/07 13:54:19 WARN mapred.LocalJobRunner: job_local626380383_0003
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
    at org.htuple.TupleMapReducePartitioner.getPartition(TupleMapReducePartitioner.java:28)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:601)
    at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
    at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:49)
    at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:1)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
    at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:51)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)

错误位于context.write(new Text(LHS1), outputKey);的AprioriItemMapper1.java:49但是上面的打印细节来自Mapper 2

任何更好的方式做到这一点
请提出建议。

最佳答案

我建议使用二次排序，这将保证第一个值(按字典顺序排序)是一个数字，假设不存在以数字开头的单词。

如果这行不通，那么，在您提到的可伸缩性限制的情况下，我会将化简器的值存储在HashMap<String,Double>缓冲区中，其中键是“=>”的左侧部分，而值是其数字值。
您可以存储值，直到获得分母BsupportCnt的值。然后，您可以发出具有正确分数的所有缓冲区内容，以及所有剩余值，当它们一一对应时，而无需再次使用该缓冲区(因为您现在知道分母)。像这样:

public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
    Map<String,Double> buffer = new HashMap<>();
    double BsupportCnt = 0;
    double UsupportCnt;
    double res;
    for(Text value : values){
        String v = value.toString();

        if(!v.contains("=>")){
            BsupportCnt = Double.parseDouble(v);
        } else {
            String parts[] = v.split(" ");
            UsupportCnt = Double.parseDouble(parts[1]);

            if (BsupportCnt != 0) { //no need to add things to the buffer any more
               res = UsupportCnt/BsupportCnt;
               context.write(new Text(v), new DoubleWritable(res));
            } else {
               buffer.put(parts[0], UsupportCnt);
            }
        }

    }


    //now emit the buffer's contents
    for (Map<String,Double>.Entry entry : buffer) {
        context.write(new Text(entry.getKey()), new DoubleWritable(entry.getValue()/BsupportCnt));
    }
}

通过仅将“=>”的左侧部分存储为HashMap的键，您可以获得更多空间，因为右侧部分始终是化简器的输入键。

关于java - 如何在hadoop中管理联接-MultipleInputPath，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/25160703/