hive的启动:
1、启动hadoop
2、开启 metastore 在开启 hiveserver2服务
nohup hive --service metastore >> log.out 2>&1 &
nohup hive --service hiveserver2 >> log.out 2>&1 &
查看进程是否起起来:
tandemac:bin tanzhengqiang$ jps -ml | grep Hive
数据结构
1、视频表
video id | 视频唯一id | 11位字符串 |
uploader | 视频上传者 | 上传视频的用户名String |
age | 视频年龄 | 视频在平台上的整数天 |
category | 视频类别 | 上传视频指定的视频分类 |
length | 视频长度 | 整形数字标识的视频长度 |
views | 观看次数 | 视频被浏览的次数 |
rate | 视频评分 | 满分5分 |
ratings | 流量 | 视频的流量,整型数字 |
conments | 评论数 | 一个视频的整数评论数 |
related ids | 相关视频id | 相关视频的id,最多20个 |
2、用户表
uploader | 上传者用户名 | string |
videos | 上传视频数 | int |
friends | 朋友数量 | int |
建表:
创建原始表:gulivideo_ori,gulivideo_user_ori
创建目标表:gulivideo_orc,gulivideo_user_orc
gulivideo_ori:
create table gulivideo_ori( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string> ) row format delimited fields terminated by "\t" -- 字段与字段之间的数据按/t分割 collection items terminated by "&" -- 数组中的数据是按&分割 stored as textfile;
gulivideo_user_ori:
create table gulivideo_user_ori( uploader string, videos int, friends int ) row format delimited fields terminated by "\t" stored as textfile;
gulivideo_orc:
create table gulivideo_orc( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string> ) clustered by(uploader) into 8 buckets -- 按照字段uploader分成8个桶 row format delimited fields terminated by "\t" collection items terminated by "&" stored as orc;
gulivideo_user_orc:
create table gulivideo_user_orc( uploader string, videos int, friends int ) row format delimited fields terminated by "\t" stored as orc;
导数:
gulivideo_ori:
load data inpath '/guliData/output/video/2008/0222' into table gulivideo_ori;
gulivideo_user_ori:
load data inpath "/guliData/input/user/2008/0903" into table gulivideo_user_ori;
gulivideo_orc:
insert into table gulivideo_orc select * from gulivideo_ori;
gulivideo_user_orc:
insert into table gulivideo_user_orc select * from gulivideo_user_ori;
统计分析
统计视频观看数Top10
思路:使用order by
按照 views 字段做一个全局排序
即可,同时我们设置只显示前10条。为了便于显示,我们显示的字段不包含每个视频对应的关联视频字段
。
最终代码:
select videoId, uploader, age, category, length, views, rate, ratings, comments from gulivideo_orc order by views desc limit 10;
统计视频类别热度Top10
思路:炸开数组【视频类别】字段,然后按照类别分组,最后按照热度(视频个数)排序。 1) 即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。 2) 我们需要按照类别 group by 聚合,然后count组内的videoId个数即可。 3) 因为当前表结构为:一个视频对应一个或多个类别。所以如果要 group by 类别,需要先将类别进行列转行(展开),然后再进行count即可。 4) 最后按照热度排序,显示前10条。
最终代码: