Pig 10.0 - 将元组分组并且合并到foreach中

本文介绍了Pig 10.0 - 将元组分组并且合并到foreach中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Pig 10.0 。我想在foreach中合并包。假设我有以下访问者别名：

 （a，b （a，e，{7}），
（z，{1,2,3,4}），
（a，d，{1,3,6}），
 ，b，{1，2，3}）

我想在第一个字段并将这些包与一组语义合并以获得以下元组：

 （{1,2,3,4,6 ，7}，a，6）
（{1，2，3}，z，3）

第一个领域是袋子与一组语义的结合。元组的第二个字段是组字段。第三个字段是包中的数字项。

我尝试了以下代码的几个变体（由Group / Distinct等取代SetUnion），但始终未能实现想要的行为：
DEFINE SetUnion datafu.pig.bags.sets.SetUnion（）;

分组= =（FirstField）的GROUP访客;
merged = FOREACH分组{
VU = SetUnion（visitors.ThirdField）;
生成
VU作为Vu，
组作为FirstField，
COUNT（VU）作为Cnt;
}
转储合并;

你能解释我错在哪里以及如何实现所需的行为吗？
解决方案
我终于实现了想要的行为。我的解决方案的一个自包含的例子如下：

数据文件：

ab 1
ab 2
ab 3
ab 4
ad 1
ab 3
ab 6
ae 7
zb 1
zb 2
zb 3

代码：
- 准备数据
in = LOAD'data'使用PigStorage（）
AS（一个：chararray，Two：chararray，Id：long）;

grp = GROUP in by（One，Two）;
cnt = FOREACH grp {
ids = DISTINCT in.Id;
生成
ids作为Ids，
组。一个为一个，
组。一个为两个，
COUNT（ids）为计数;
}

- 有趣的代码如下
grp2 = GROUP by cnt;
cnt2 = FOREACH grp2 {
ids = FOREACH cnt.Ids生成FLATTEN（$ 0）;
生成
ids作为ID，
组合为一，
COUNT（ids）作为计数;
}

描述cnt2;
转储grp2;
转储cnt2;

描述：

Cnt：{Ids：{（Id：long）}，One：chararray，Two：chararray，Count：long}

grp2：
（a，{（{（1），（2），（1），（4），（6）}，a，b，5） $ b（z，{（{（1），（2），（3）}，z，b，3）}）

cnt2：

（{（1），（2），（3），（4 ），（6），（1），（7）}，a，7）
（{（1），（2），（3）}，z，3）

$ b
由于代码使用嵌套在FOREACH中的FOREACH，它需要Pig> 10.0。

我会让这个问题在可能存在更干净的解决方案后的几天内解决。

I'm using Pig 10.0. I want to Merge bags in a foreach. Let's say I have the following visitors alias:
(a, b, {1, 2, 3, 4}), (a, d, {1, 3, 6}), (a, e, {7}), (z, b, {1, 2, 3})
I want to group the tuples on the first field and merge the bags with a set semantic to get the following following tuples:
({1, 2, 3, 4, 6, 7}, a, 6) ({1, 2, 3}, z, 3)
The first field is the union of the bags with a set semantic. The second field of the tuple is the group field. The third field is the number items in the bag.
I tried several variations around the following code (replaced SetUnion by Group/Distinct etc.) but always failed to achieve the wanted behavior:
DEFINE SetUnion datafu.pig.bags.sets.SetUnion(); grouped = GROUP visitors by (FirstField); merged = FOREACH grouped { VU = SetUnion(visitors.ThirdField); GENERATE VU as Vu, group as FirstField, COUNT(VU) as Cnt; } dump merged;
Can you explain where I'm wrong and how to implement the desired behavior?
解决方案
I finally managed to achieve the wanted behavior. A self contained example of my solution follows:
Data file:
a b 1 a b 2 a b 3 a b 4 a d 1 a b 3 a b 6 a e 7 z b 1 z b 2 z b 3
Code:
-- Prepare data in = LOAD 'data' USING PigStorage() AS (One:chararray, Two:chararray, Id:long); grp = GROUP in by (One, Two); cnt = FOREACH grp { ids = DISTINCT in.Id; GENERATE ids as Ids, group.One as One, group.Two as Two, COUNT(ids) as Count; } -- Interesting code follows grp2 = GROUP cnt by One; cnt2 = FOREACH grp2 { ids = FOREACH cnt.Ids generate FLATTEN($0); GENERATE ids as Ids, group as One, COUNT(ids) as Count; } describe cnt2; dump grp2; dump cnt2;
Describe:
Cnt: {Ids: {(Id: long)},One: chararray,Two: chararray,Count: long}
grp2:
(a,{({(1),(2),(3),(4),(6)},a,b,5),({(1)},a,d,1),({(7)},a,e,1)}) (z,{({(1),(2),(3)},z,b,3)})
cnt2:
({(1),(2),(3),(4),(6),(1),(7)},a,7) ({(1),(2),(3)},z,3)
Since the code uses a FOREACH nested in a FOREACH it requires Pig > 10.0.
I will let the question as unresolved for a few days since a cleaner solution probably exists.

这篇关于Pig 10.0 - 将元组分组并且合并到foreach中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！