本文介绍了依靠多列分组并获取原始数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
2, cornflakes, Regular,General Mills, 12
3, cornflakes, Mixed Nuts, Post, 14
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7
脚本
data_grp = GROUP data BY (item, type);
data_cnt = FOREACH data_grp GENERATE FLATTEN (group) AS(item, type), count(data) as total;
filter_data = FILTER data_cnt BY total < 2;
我现在需要应用过滤器的原始数据和我想要的输出是:
I now need the original data with the filter applied andmy desired output is:
4, chocolate syrup, Regular, Hersheys, 5
6, chocolate syrup, Regular, Ghirardeli, 6
推荐答案
filter_data 会给你巧克力糖浆,Regular
.将filter_data与带有item的原始数据集连接起来,输入得到想要的结果.
filter_data will give you chocolate syrup, Regular
.Join the filter_data with original dataset with item,type and get the desired result.
data_grp = GROUP data BY (item, type);
data_cnt = FOREACH data_grp GENERATE FLATTEN (group) AS(item, type), COUNT(data) as total;
filter_data = FILTER data_cnt BY total < 2;
o_data = JOIN data BY (item,type),filter_data BY ($0,$1);
final_data = FOREACH o_data GENERATE $0..$4;
DUMP final_data;
这篇关于依靠多列分组并获取原始数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!