输入数据集:
field1,field2,field3,field4,field5
101,a1,a11,a111,a1111
102,a1,a11,a111,a1111
103,a1,a11,a111,a1111
201,b1,b11,b111,b1111
202,b1,b11,b111,b1111
下面的查询将在Pig中给出不同的记录。
details = load 'emp.csv' using PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray,field4:chararray,field5:chararray);
distinct_detials = DISTINCT details;
我有一个用例,需要根据field2,field3,field4获得不同的记录。
预期输出为
101,a1,a11,a111,a1111
202,b1,b11,b111,b1111
最佳答案
您可以使用嵌套的foreach完成所需的操作:
details = load 'emp.csv' using PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray,field4:chararray,field5:chararray);
distinct_detials = foreach (GROUP details by (field2, field3, field4) ) {
temp_rel = details.(field1, field5);
temp_limit = LIMIT temp_rel 1;
generate FLATTEN(temp_limit) as (field1, field5), FLATTEN(group) as (field2, field3, field4);
}
DUMP distinct_details;
这将给出以下输出:
(103,a1111,a1,a11,a111)
(202,b1111,b1,b11,b111)
您可以进一步在
foreach
上使用distinct_details
来按顺序排列字段。关于hadoop - 使用Pig基于多个字段的不同记录,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/44641785/