我已经将数据分组和汇总,看起来像这样:

user    value      count
----    --------  ------
Alice   third      5
Alice   first      11
Alice   second     10
Alice   fourth     2
...
Bob     second     20
Bob     third      18
Bob     first      21
Bob     fourth     8
...

对于每个用户(爱丽丝和鲍勃),我想检索他们的的前n个值(假设2),对“计数”进行排序。
所以我想要的输出是这样的:
Alice first 11
Alice second 10
Bob first 21
Bob second 20

我该怎么做?

最佳答案

一种方法是

records = LOAD '/user/nubes/ncdc/micro-tab/top.txt' AS (user:chararray,value:chararray,counter:int);
grpd = GROUP records BY user;

top3 = foreach grpd {
        sorted = order records by counter desc;
        top    = limit sorted 2;
        generate group, flatten(top);
};

输入为:
Alice   third   5
Alice   first   11
Alice   second  10
Alice   fourth  2
Bob second  20
Bob third   18
Bob first   21
Bob fourth  8

输出为:
(Alice,Alice,first,11)
(Alice,Alice,second,10
(Bob,Bob,first,21)
(Bob,Bob,second,20)

关于hadoop - pig :每组获取前n个值,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/17656012/

10-12 23:49