问题描述
我正在使用 Pig 10.0 。我想在foreach中合并包。假设我有以下访问者别名:
(a,b (a,e,{7}),
(z,{1,2,3,4}),
(a,d,{1,3,6}),
,b,{1,2,3})
我想在第一个字段并将这些包与一组语义合并以获得以下元组:
({1,2,3,4,6 ,7},a,6)
({1,2,3},z,3)
第一个领域是袋子与一组语义的结合。元组的第二个字段是组字段。第三个字段是包中的数字项。
我尝试了以下代码的几个变体(由Group / Distinct等取代SetUnion),但始终未能实现想要的行为:
DEFINE SetUnion datafu.pig.bags.sets.SetUnion();
分组= =(FirstField)的GROUP访客;
merged = FOREACH分组{
VU = SetUnion(visitors.ThirdField);
生成
VU作为Vu,
组作为FirstField,
COUNT(VU)作为Cnt;
}
转储合并;
你能解释我错在哪里以及如何实现所需的行为吗?
我终于实现了想要的行为。我的解决方案的一个自包含的例子如下:
数据文件:
ab 1
ab 2
ab 3
ab 4
ad 1
ab 3
ab 6
ae 7
zb 1
zb 2
zb 3
代码:
- 准备数据
in = LOAD'data'使用PigStorage()
AS(一个:chararray,Two:chararray,Id:long);
grp = GROUP in by(One,Two);
cnt = FOREACH grp {
ids = DISTINCT in.Id;
生成
ids作为Ids,
组。一个为一个,
组。一个为两个,
COUNT(ids)为计数;
}
- 有趣的代码如下
grp2 = GROUP by cnt;
cnt2 = FOREACH grp2 {
ids = FOREACH cnt.Ids生成FLATTEN($ 0);
生成
ids作为ID,
组合为一,
COUNT(ids)作为计数;
}
描述cnt2;
转储grp2;
转储cnt2;
描述:
Cnt:{Ids:{(Id:long)},One:chararray,Two:chararray,Count:long}
grp2:
(a,{({(1),(2), (1),(4),(6)},a,b,5) $ b(z,{({(1),(2),(3)},z,b,3)})
cnt2:
({(1),(2),(3),(4 ),(6),(1),(7)},a,7)
({(1),(2),(3)},z,3)
$ b由于代码使用嵌套在FOREACH中的FOREACH,它需要Pig> 10.0。
我会让这个问题在可能存在更干净的解决方案后的几天内解决。
I'm using Pig 10.0. I want to Merge bags in a foreach. Let's say I have the following visitors alias:
(a, b, {1, 2, 3, 4}), (a, d, {1, 3, 6}), (a, e, {7}), (z, b, {1, 2, 3})I want to group the tuples on the first field and merge the bags with a set semantic to get the following following tuples:
({1, 2, 3, 4, 6, 7}, a, 6) ({1, 2, 3}, z, 3)The first field is the union of the bags with a set semantic. The second field of the tuple is the group field. The third field is the number items in the bag.
I tried several variations around the following code (replaced SetUnion by Group/Distinct etc.) but always failed to achieve the wanted behavior:
DEFINE SetUnion datafu.pig.bags.sets.SetUnion(); grouped = GROUP visitors by (FirstField); merged = FOREACH grouped { VU = SetUnion(visitors.ThirdField); GENERATE VU as Vu, group as FirstField, COUNT(VU) as Cnt; } dump merged;Can you explain where I'm wrong and how to implement the desired behavior?
解决方案I finally managed to achieve the wanted behavior. A self contained example of my solution follows:
Data file:
a b 1 a b 2 a b 3 a b 4 a d 1 a b 3 a b 6 a e 7 z b 1 z b 2 z b 3Code:
-- Prepare data in = LOAD 'data' USING PigStorage() AS (One:chararray, Two:chararray, Id:long); grp = GROUP in by (One, Two); cnt = FOREACH grp { ids = DISTINCT in.Id; GENERATE ids as Ids, group.One as One, group.Two as Two, COUNT(ids) as Count; } -- Interesting code follows grp2 = GROUP cnt by One; cnt2 = FOREACH grp2 { ids = FOREACH cnt.Ids generate FLATTEN($0); GENERATE ids as Ids, group as One, COUNT(ids) as Count; } describe cnt2; dump grp2; dump cnt2;Describe:
Cnt: {Ids: {(Id: long)},One: chararray,Two: chararray,Count: long}grp2:
(a,{({(1),(2),(3),(4),(6)},a,b,5),({(1)},a,d,1),({(7)},a,e,1)}) (z,{({(1),(2),(3)},z,b,3)})cnt2:
({(1),(2),(3),(4),(6),(1),(7)},a,7) ({(1),(2),(3)},z,3)Since the code uses a FOREACH nested in a FOREACH it requires Pig > 10.0.
I will let the question as unresolved for a few days since a cleaner solution probably exists.
这篇关于Pig 10.0 - 将元组分组并且合并到foreach中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!