我是Pig的新用户。
我有一个要修改的现有架构。我的源数据如下,共有6列:
Name Type Date Region Op Value
-----------------------------------------------------
john ab 20130106 D X 20
john ab 20130106 D C 19
jphn ab 20130106 D T 8
jphn ab 20130106 E C 854
jphn ab 20130106 E T 67
jphn ab 20130106 E X 98
等等。每个
Op
值始终是C
,T
或X
。我基本上想按以下方式将数据分成7列:
Name Type Date Region OpX OpC OpT
----------------------------------------------------------
john ab 20130106 D 20 19 8
john ab 20130106 E 98 854 67
基本上将
Op
列分为3列:每列对应一个Op
值。这些列中的每一个都应包含来自Value
列的适当值。如何在Pig中做到这一点?
最佳答案
一种实现所需结果的方法:
IN = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
A = order IN by op asc;
B = group A by (name, type, date, region);
C = foreach B {
bs = STRSPLIT(BagToString(A.value, ','),',',3);
generate flatten(group) as (name, type, date, region),
bs.$2 as OpX:chararray, bs.$0 as OpC:chararray, bs.$1 as OpT:chararray;
}
describe C;
C: {name: chararray,type: chararray,date: int,region: chararray,OpX:
chararray,OpC: chararray,OpT: chararray}
dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)
更新:
如果要跳过
order by
,从而在计算中添加一个额外的reduce阶段,则可以在元组v中为每个值添加对应的op前缀。然后使用custom UDF对元组字段进行排序,以具有所需的OpX,OpC,OpT顺序:register 'myjar.jar';
A = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
B = group A by (name, type, date, region);
C = foreach B {
v = foreach A generate CONCAT(op, (chararray)value);
bs = STRSPLIT(BagToString(v, ','),',',3);
generate flatten(group) as (name, type, date, region),
flatten(TupleArrange(bs)) as (OpX:chararray, OpC:chararray, OpT:chararray);
}
其中mjar.jar中的
TupleArrange
是这样的:..
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class TupleArrange extends EvalFunc<Tuple> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
@Override
public Tuple exec(Tuple input) throws IOException {
try {
Tuple result = tupleFactory.newTuple(3);
Tuple inputTuple = (Tuple) input.get(0);
String[] tupleArr = new String[] {
(String) inputTuple.get(0),
(String) inputTuple.get(1),
(String) inputTuple.get(2)
};
Arrays.sort(tupleArr); //ascending
result.set(0, tupleArr[2].substring(1));
result.set(1, tupleArr[0].substring(1));
result.set(2, tupleArr[1].substring(1));
return result;
}
catch (Exception e) {
throw new RuntimeException("TupleArrange error", e);
}
}
@Override
public Schema outputSchema(Schema input) {
return input;
}
}
关于hadoop - 清管器将架构更改为所需类型,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/15324747/