问题描述
" YouTube倒带:2017年的形状| #YouTubeRewind" 125431369 2912715 1545018 807558
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 125431369 2912715 1545018 807558
" YouTube倒带:2017年的形状| #YouTubeRewind" 113876217 2811217 1470387 787174
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 113876217 2811217 1470387 787174
" YouTube倒带:2017年的形状| #YouTubeRewind" 100911567 2656678 1353655 682890
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 100911567 2656678 1353655 682890
漫威影业的复仇者联盟:无限战争官方预告片" 89930713 2606665 53011 347982
"Marvel Studios' Avengers: Infinity War Official Trailer" 89930713 2606665 53011 347982
漫威影业的复仇者联盟:无限战争官方预告片" 87450245 2584675 52176 341571
"Marvel Studios' Avengers: Infinity War Official Trailer" 87450245 2584675 52176 341571
漫威影业的复仇者联盟:无限战争官方预告片" 84281319 2555414 51008 339708
"Marvel Studios' Avengers: Infinity War Official Trailer" 84281319 2555414 51008 339708
漫威影业的复仇者联盟:无限战争官方预告片" 80360459 2513103 49170 335920
"Marvel Studios' Avengers: Infinity War Official Trailer" 80360459 2513103 49170 335920
" YouTube倒带:2017年的形状| #YouTubeRewind" 75969469 2251826 1127811 827755
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 75969469 2251826 1127811 827755
漫威影业的复仇者联盟:无限战争官方预告片" 74789251 2444960 46172 330710
"Marvel Studios' Avengers: Infinity War Official Trailer" 74789251 2444960 46172 330710
漫威影业的复仇者联盟:无限战争官方预告片" 66637636 2331359 41154 316185
"Marvel Studios' Avengers: Infinity War Official Trailer" 66637636 2331359 41154 316185
漫威影业的复仇者联盟:无限战争官方预告片" 56367282 2157741 34078 303178
"Marvel Studios' Avengers: Infinity War Official Trailer" 56367282 2157741 34078 303178
" YouTube倒带:2017年的形状| #YouTubeRewind" 52611730 1891822 884963 702784
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 52611730 1891822 884963 702784
致我们的女儿" 51243149 0 0 0
"To Our Daughter" 51243149 0 0 0
致我们的女儿" 48635732 0 0 0
"To Our Daughter" 48635732 0 0 0
在以上数据中有2列,一列是标题",另一列是标题".其他是观看次数,喜欢次数,不喜欢次数,comment_count.
in above data there is 2 columns one is "title" and other are views, likes, dislikes, comment_count.
如何使用过滤器并删除重复数据我想删除具有相同标题"的数据,并保持数据具有最高的观看次数
how to use filter and remove repeating datai want to remove the data which is having same "title: and keep the data with highest views
推荐答案
如果要保留记录中与MAX点赞相对应的所有字段,则必须执行以下操作:
If you want to retain all fields of the record corresponding to the MAX likes, you would have to do something like so:
dataAll = LOAD 'path' USING PigStorage('\t') AS (title:chararray, views:long, likes:long, dislikes:long, comment_count:long);
--group the data by title so that all records belonging to a title fall into a bag in the same record
dataGrouped = GROUP dataAll BY title;
--Using a nested foreach, order the contents of the bag by likes and pick the top record
dataDeduped = FOREACH dataGrouped {
soredtedByLikes = ORDER dataAll BY likes DESC;
maxLikesRecord = LIMIT soredtedByLikes 1;
GENERATE FLATTEN(maxLikesRecord);
}
STORE dataDeduped INTO 'outputPath' USING PigStorage('\t');
嵌套的Foreach在这种情况下非常有用.在此处查看更多相关信息: https://www.safaribooksonline. com/library/view/programming-pig/9781449317881/ch06.html (在该链接中搜索嵌套的foreach).
Nested Foreach comes in pretty useful in such situations. Checkout more about it here: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html (Search for nested foreach in that link).
这篇关于如何删除在Pig中重复的数据行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!