hive的启动:

1、启动hadoop
2、开启 metastore 在开启 hiveserver2服务
nohup hive --service metastore >> log.out 2>&1 &
nohup hive --service hiveserver2 >> log.out 2>&1 &
查看进程是否起起来:
tandemac:bin tanzhengqiang$ jps -ml | grep Hive

数据结构

1、视频表

video id

视频唯一id

11位字符串

uploader

视频上传者

上传视频的用户名String

age

视频年龄

视频在平台上的整数天

category

视频类别

上传视频指定的视频分类

length

视频长度

整形数字标识的视频长度

views

观看次数

视频被浏览的次数

rate

视频评分

满分5分

ratings

流量

视频的流量,整型数字

conments

评论数

一个视频的整数评论数

related ids

相关视频id

相关视频的id,最多20个

2、用户表

uploader

上传者用户名

string

videos

上传视频数

int

friends

朋友数量

int

 建表:

创建原始表:gulivideo_ori,gulivideo_user_ori
创建目标表:gulivideo_orc,gulivideo_user_orc
gulivideo_ori:

create table gulivideo_ori(
  videoId string,
  uploader string,
  age int,
  category array<string>,
  length int,
  views int,
  rate float,
  ratings int,
  comments int,
  relatedId array<string>
)
row format delimited
fields terminated by "\t" -- 字段与字段之间的数据按/t分割
collection items terminated by "&" -- 数组中的数据是按&分割
stored as textfile;

gulivideo_user_ori:

create table gulivideo_user_ori(
  uploader string,
  videos int,
  friends int
)
row format delimited
fields terminated by "\t"
stored as textfile;

gulivideo_orc:

create table gulivideo_orc(
  videoId string,
  uploader string,
  age int,
  category array<string>,
  length int,
  views int,
  rate float,
  ratings int,
  comments int,
  relatedId array<string>
)
clustered by(uploader) into 8 buckets -- 按照字段uploader分成8个桶
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as orc;

gulivideo_user_orc:

create table gulivideo_user_orc(
  uploader string,
  videos int,
  friends int
)
row format delimited
fields terminated by "\t"
stored as orc;

导数:

gulivideo_ori:

load data inpath '/guliData/output/video/2008/0222' into table gulivideo_ori;

gulivideo_user_ori:

load data inpath "/guliData/input/user/2008/0903" into table gulivideo_user_ori;

gulivideo_orc:

insert into table gulivideo_orc select * from gulivideo_ori;

gulivideo_user_orc:

insert into table gulivideo_user_orc select * from gulivideo_user_ori;

统计分析

统计视频观看数Top10

思路:使用order by按照 views 字段做一个全局排序即可,同时我们设置只显示前10条。为了便于显示,我们显示的字段不包含每个视频对应的关联视频字段

最终代码:

select
  videoId,
  uploader,
  age,
  category,
  length,
  views,
  rate,
  ratings,
  comments
from
  gulivideo_orc
order by
  views desc
limit
  10;

统计视频类别热度Top10

思路:炸开数组【视频类别】字段,然后按照类别分组,最后按照热度(视频个数)排序。   1) 即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。   2) 我们需要按照类别 group by 聚合,然后count组内的videoId个数即可。   3) 因为当前表结构为:一个视频对应一个或多个类别。所以如果要 group by 类别,需要先将类别进行列转行(展开),然后再进行count即可。   4) 最后按照热度排序,显示前10条。

最终代码:

02-10 04:22