本文介绍了Apache的猪:在另一个过滤一个元组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过拆分出两个元(或任何它被称为猪)来运行一个猪脚本,以关闭的标准在 COL2 和操作<$ C后$ C> COL2 ,进入另一列,比较两个操纵元组,做一个额外的排除。

 注册/home/user1/piggybank.jar;登录= LOAD'../user2/hadoop_file.txt'AS(COL1,COL2);--log =限制日志1000000;
isnt_filtered =过滤日志BY(不COL2 =='一些价值');
isnt_generated = FOREACH isnt_filtered GENERATE COL2,COL1,随机的()* 1000000随机,com.some.valueManipulation AS isnt_manipulated(COL1);is_filtered =过滤日志BY(COL2 =='一些价值');
is_generated = FOREACH is_filtered GENERATE com.some.calculation(COL1)AS is_manipulated;
is_distinct = DISTINCT is_generated;

分割和操作是容易的部分。这是它变得复杂。 。

  merge_filtered = FOREACH is_generated {FILTER isnt_generated BY(不isnt_manipulated == is_generated.is_manipulated)};

如果我能想出​​这条线(S),其余将下降到位。

  merge_ordered = ORDER BY merge_filtered随机,COL2,COL1;
merge_limited = LIMIT merge_ordered 400000;STORE merge_limited到'文件';

这里的I / O的例子:

  COL1 COL2操纵
这QWERTYW¯¯
是QWERTYř
一个QWERTYÿ
例如QWERTYê
QWERTY的Ť
例如QWERTY键盘Q
数据QWERTYW¯¯
心不是
Ë
ÿ
COL1 COL2
这QWERTY
是QWERTY
QWERTY的
例如QWERTY键盘
数据QWERTY


解决方案

我仍然不知道什么很需要,但我相信你可以复制你的输入和输出具有以下(未经测试):

 数据= LOAD'输入'AS(COL1:chararray,COL2:chararray);
排除= LOAD'排除'AS(不包括:chararray);M = FOREACH数据来生成COL1,COL2,YourUDF(COL2)AS操纵;
测试=协同组米乘操作,排除除外; - 在这里,您可以根据您是否要排除或包括选择的IsEmpty或NOT的IsEmpty
最终= FOREACH(FILTER测试BY的IsEmpty(排除))产生FLATTEN(米);

通过由分组键每个关系协同组,你组中的所有元组。如果元组从袋排除是空的,这意味着分组关键是不是在排除列表present,让你保持距离 M 使用该密钥。相反,如果分组键是present 排除,那个包就不会是空的,并从 M 使用该密钥将被过滤掉。

I want to run a Pig script by splitting out two tuples (or whatever it's called in Pig), based off of criteria in col2, and after manipulating col2, into another column, compare the two manipulated tuples and do an additional exclude.

REGISTER /home/user1/piggybank.jar;

log = LOAD '../user2/hadoop_file.txt' AS (col1, col2);

--log = LIMIT log 1000000;
isnt_filtered = FILTER log BY (NOT col2 == 'Some value');
isnt_generated = FOREACH isnt_filtered GENERATE col2, col1, RANDOM() * 1000000 AS random, com.some.valueManipulation(col1) AS isnt_manipulated;

is_filtered = FILTER log BY (col2 == 'Some value');
is_generated = FOREACH is_filtered GENERATE com.some.calculation(col1) AS is_manipulated;
is_distinct = DISTINCT is_generated;

Splitting and manipulating is the easy part. This is where it gets complicated. . .

merge_filtered = FOREACH is_generated {FILTER isnt_generated BY (NOT isnt_manipulated == is_generated.is_manipulated)};

If I can figure out this line(s), the rest would fall in place.

merge_ordered = ORDER merge_filtered BY random, col2, col1;
merge_limited = LIMIT merge_ordered 400000;

STORE merge_limited into 'file';

Here's an example of the I/O:

col1                col2            manipulated
This                qWerty          W
Is                  qweRty          R
An                  qwertY          Y
Example             qwErty          E
Of                  qwerTy          T
Example             Qwerty          Q
Data                qWerty          W


isnt
E
Y


col1                col2
This                qWerty
Is                  qweRty
Of                  qwerTy
Example             Qwerty
Data                qWerty
解决方案

I'm still not sure quite what you need, but I believe you can reproduce your input and output with the following (untested):

data = LOAD 'input' AS (col1:chararray, col2:chararray);
exclude = LOAD 'exclude' AS (excl:chararray);

m = FOREACH data GENERATE col1, col2, YourUDF(col2) AS manipulated;
test = COGROUP m BY manipulated, exclude BY excl;

-- Here you can choose IsEmpty or NOT IsEmpty according to whether you want to exclude or include
final = FOREACH (FILTER test BY IsEmpty(exclude)) GENERATE FLATTEN(m);

With the COGROUP, you group all tuples in each relation by the grouping key. If the bag of tuples from exclude is empty, it means that the grouping key was not present in the exclude list, so you keep tuples from m with that key. Conversely, if the grouping key was present in exclude, that bag will not be empty and the tuples from m with that key will be filtered out.

这篇关于Apache的猪:在另一个过滤一个元组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 18:20