我想通过根据col2
中的条件将两个元组(或在Pig中所谓的任何名称)拆分成两个元组,然后在操作col2
之后,将其运行到另一列中,比较两个经过处理的元组并执行其他排除操作。
REGISTER /home/user1/piggybank.jar;
log = LOAD '../user2/hadoop_file.txt' AS (col1, col2);
--log = LIMIT log 1000000;
isnt_filtered = FILTER log BY (NOT col2 == 'Some value');
isnt_generated = FOREACH isnt_filtered GENERATE col2, col1, RANDOM() * 1000000 AS random, com.some.valueManipulation(col1) AS isnt_manipulated;
is_filtered = FILTER log BY (col2 == 'Some value');
is_generated = FOREACH is_filtered GENERATE com.some.calculation(col1) AS is_manipulated;
is_distinct = DISTINCT is_generated;
拆分和操作是容易的部分。这是复杂的地方。 。 。
merge_filtered = FOREACH is_generated {FILTER isnt_generated BY (NOT isnt_manipulated == is_generated.is_manipulated)};
如果我能弄清楚这条线,其余的将就位。
merge_ordered = ORDER merge_filtered BY random, col2, col1;
merge_limited = LIMIT merge_ordered 400000;
STORE merge_limited into 'file';
这是I / O的示例:
col1 col2 manipulated
This qWerty W
Is qweRty R
An qwertY Y
Example qwErty E
Of qwerTy T
Example Qwerty Q
Data qWerty W
isnt
E
Y
col1 col2
This qWerty
Is qweRty
Of qwerTy
Example Qwerty
Data qWerty
最佳答案
我仍然不确定您需要什么,但是我相信您可以使用以下内容(未经测试)重现您的输入和输出:
data = LOAD 'input' AS (col1:chararray, col2:chararray);
exclude = LOAD 'exclude' AS (excl:chararray);
m = FOREACH data GENERATE col1, col2, YourUDF(col2) AS manipulated;
test = COGROUP m BY manipulated, exclude BY excl;
-- Here you can choose IsEmpty or NOT IsEmpty according to whether you want to exclude or include
final = FOREACH (FILTER test BY IsEmpty(exclude)) GENERATE FLATTEN(m);
使用
COGROUP
,您可以通过分组键将每个关系中的所有元组分组。如果exclude
中的元组包为空,则表示排除列表中不存在分组键,因此您可以使用该键将m
中的元组保留下来。相反,如果exclude
中存在分组键,则该包将不会为空,并且带有该键的m
中的元组将被过滤掉。关于hadoop - Apache Pig:在另一个元组上过滤一个元组?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/13424947/