我想编写一个 pig 脚本,以查找访问特定网页的唯一用户ID的数量。
表定义:a = (userid:chararray, otherid:chararray, webpage:chararray)
这是我写的,但是没有用
a = (userid:chararray, otherid:chararray, webpage:chararray)
group_by_page = GROUP a by webpage ;
count_d = FOREACH group_by_page GENERATE group, count(distinct(a.userid));
最佳答案
您需要在嵌套的foreach中使用DISTINCT
;它不是UDF。这应该可以将您带到需要去的地方:
a = LOAD 'input' AS (userid:chararray, otherid:chararray, webpage:chararray);
group_by_page = GROUP a by webpage;
count_d = FOREACH group_by_page { uniq = DISTINCT a.userid; GENERATE group, COUNT(uniq); };
转到here了解有关嵌套的foreach的更多信息。
关于hadoop - 在网页上查找唯一身份访问者,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/21917548/