上一篇博客中hive中加载的数据都是比较规整的(Hive的基本操作:https://blog.csdn.net/Chris_MZJ/article/details/83713882),字段与

字段之间都是分割好的,每一个字段都不是脏数据,并且每一个字段都是有意义的但是在真实场景中不见得这个尽人意。比如hive要读取以下格式的tomcat的运行日志:

192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -

我们可以看到,日志中的数据是不规则的,不能凭某一个分隔符就能将数据分割成字段,为此,可以通过使用正则表达式来读取数据,创建一个logtbl表:

 CREATE TABLE logtbl (
    host STRING,
    identity STRING,
    t_user STRING,
    time STRING,
    request STRING,
    referer STRING,
    agent STRING)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
  WITH SERDEPROPERTIES (
    "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0-9]*)"
  )
  STORED AS TEXTFILE;

input.regex" = "([^ ]) ([^ ]) ([^ ]) \[(.)\] "(.)" (-|[0-9]) (-|[0-9]*)"就是通过正则表达式来读取以上的不规则数据,当读取其他数据的时候,只需要将正则表达式替换即可。

org.apache.hadoop.hive.serde2.RegexSerDe是引用的jar包。

创建logerr文件:

[root@client ~]# vim logerr

192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -

将logerr加载到logtbl中并查询:

hive> load data local inpath "/root/logerr" into table logtbl;
Loading data to table test.logtbl
Table test.logtbl stats: [numFiles=1, totalSize=1739]
OK
Time taken: 2.085 seconds
hive> select * from logtbl;
OK
logtbl.host	logtbl.identity	logtbl.t_user	logtbl.time	logtbl.request	logtbl.referer	logtbl.agent
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-upper.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-nav.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /asf-logo.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-middle.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /asf-logo.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-middle.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-nav.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-upper.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-upper.png HTTP/1.1	304	-
Time taken: 0.21 seconds, Fetched: 22 row(s)
hive>

通过正则表达式,读取到了logerr中的不规则数据。

例如,创建包含下面这条数据的文件:

[root@client ~]# vim err

192.168.57.4 - - 123

然后将err文件加载到上面已经创建好的logtbl表中:

hive> load data local inpath "/root/err" into table logtbl;
Loading data to table test.logtbl
Table test.logtbl stats: [numFiles=2, totalSize=1760]
OK
Time taken: 0.465 seconds
hive>

这一步没有问题,因为load就是将数据拷贝到工作目录区中,接下来查询表中的记录:

hive> select * from logtbl;
OK
logtbl.host	logtbl.identity	logtbl.t_user	logtbl.time	logtbl.request	logtbl.referer	logtbl.agent
NULL	NULL	NULL	NULL	NULL	NULL	NULL
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-upper.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-nav.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /asf-logo.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-middle.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /asf-logo.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-middle.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-nav.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-upper.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-upper.png HTTP/1.1	304	-
Time taken: 0.126 seconds, Fetched: 23 row(s)
hive>

可以看到第一条记录全是NULL,那是因为hive因为根据正则表达式的模板来读数据,读不懂192.168.57.4 - - 123这条数据,所以字段全是NULL。

由此可以得出:hive根据正则表达式读数据的时候,是读时检查数据格式对不对,而不是写时检查。

11-22 20:37