从包中选择随机元组

从包中选择随机元组

本文介绍了从包中选择随机元组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以(有效地)从猪的袋子中选择一个随机元组?我可以只取一个包的第一个结果(因为它是无序的),但就我而言,我需要一个适当的随机选择.一种(不是有效的)解决方案是计算包中元组的数量,在该范围内取一个随机数,遍历包,并在迭代次数与我的随机数匹配时停止.有谁知道更快/更好的方法来做到这一点?

Is it possible to (efficiently) select a random tuple from a bag in pig?I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection.One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this?

推荐答案

你可以在嵌套的 FOREACH 语句中使用 RANDOM()、ORDER 和 LIMIT 来选择一个随机数最小的元素:

You could use RANDOM(), ORDER and LIMIT in a nested FOREACH statement to select one element with the smallest random number:

inpt = load 'group.txt' as (id:int, c1:bytearray, c2:bytearray);
groups = group inpt by id;
randoms = foreach groups {
    rnds = foreach inpt generate *, RANDOM() as rnd; -- assign random number to each row in the bag
    ordered_rnds = order rnds by rnd;
    one_tuple = limit ordered_rnds 1; -- select tuple with the smallest random number
    generate group as id, one_tuple;
};

转储随机数;

输入:

1   a   r
1   a   t
1   b   r
1   b   4
1   e   4
1   h   4
1   k   t
2   k   k
2   j   j
3   a   r
3   e   l
3   j   l
4   a   r
4   b   t
4   b   g
4   h   b
4   j   d
5   h   k

输出:

(1,{(1,b,r,0.05172709255901231)})
(2,{(2,k,k,0.14351660053632986)})
(3,{(3,e,l,0.0854104195792681)})
(4,{(4,h,b,8.906013598960483E-4)})
(5,{(5,h,k,0.6219490873384448)})

如果你运行dump randoms;"多次运行,每次运行应该得到不同的结果.

If you run "dump randoms;" multiple times, you should get different results for each run.

编写 UDF 可能会给您带来更好的性能,因为您不需要在包内随机进行二次排序.

Writing a UDF might give you better performance as you do not need to do secondary sort on random within the bag.

这篇关于从包中选择随机元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 14:36