我是Pig的新用户。

我有一个要修改的现有架构。我的源数据如下,共有6列:

Name        Type    Date        Region    Op    Value
-----------------------------------------------------
john        ab      20130106    D         X     20
john        ab      20130106    D         C     19
jphn        ab      20130106    D         T     8
jphn        ab      20130106    E         C     854
jphn        ab      20130106    E         T     67
jphn        ab      20130106    E         X     98

等等。每个Op值始终是CTX

我基本上想按以下方式将数据分成7列:
Name        Type    Date        Region    OpX    OpC   OpT
----------------------------------------------------------
john        ab      20130106    D         20     19    8
john        ab      20130106    E         98     854   67

基本上将Op列分为3列:每列对应一个Op值。这些列中的每一个都应包含来自Value列的适当值。

如何在Pig中做到这一点?

最佳答案

一种实现所需结果的方法:

IN = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
       date:int, region:chararray, op:chararray, value:int);
A = order IN by op asc;
B = group A by (name, type, date, region);
C = foreach B {
  bs = STRSPLIT(BagToString(A.value, ','),',',3);
  generate flatten(group) as (name, type, date, region),
    bs.$2 as OpX:chararray, bs.$0 as OpC:chararray, bs.$1 as OpT:chararray;
}

describe C;
C: {name: chararray,type: chararray,date: int,region: chararray,OpX:
    chararray,OpC: chararray,OpT: chararray}

dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)

更新:

如果要跳过order by,从而在计算中添加一个额外的reduce阶段,则可以在元组v中为每个值添加对应的op前缀。然后使用custom UDF对元组字段进行排序,以具有所需的OpX,OpC,OpT顺序:
register 'myjar.jar';
A = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
      date:int, region:chararray, op:chararray, value:int);
B = group A by (name, type, date, region);
C = foreach B {
  v = foreach A generate CONCAT(op, (chararray)value);
  bs = STRSPLIT(BagToString(v, ','),',',3);
  generate flatten(group) as (name, type, date, region),
    flatten(TupleArrange(bs)) as (OpX:chararray, OpC:chararray, OpT:chararray);
}

其中mjar.jar中的TupleArrange是这样的:
..
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;

public class TupleArrange extends EvalFunc<Tuple> {

    private static final TupleFactory tupleFactory = TupleFactory.getInstance();

    @Override
    public Tuple exec(Tuple input) throws IOException {
        try {
            Tuple result = tupleFactory.newTuple(3);
            Tuple inputTuple = (Tuple) input.get(0);
            String[] tupleArr = new String[] {
                    (String) inputTuple.get(0),
                    (String) inputTuple.get(1),
                    (String) inputTuple.get(2)
            };
            Arrays.sort(tupleArr); //ascending
            result.set(0, tupleArr[2].substring(1));
            result.set(1, tupleArr[0].substring(1));
            result.set(2, tupleArr[1].substring(1));
            return result;
        }
        catch (Exception e) {
            throw new RuntimeException("TupleArrange error", e);
        }
    }

    @Override
    public Schema outputSchema(Schema input) {
        return input;
    }
}

关于hadoop - 清管器将架构更改为所需类型,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/15324747/

10-10 17:00