问题描述
我有一些我想用 Hive 查询的 Web 服务器日志.HDFS 中的目录结构如下所示:
I have some web server logs that I'd like to query with Hive. The directory structure, in HDFS, looks like this:
/data/access/web1/2014/09
/data/access/web1/2014/09/access-20140901.log
[... etc ...]
/data/access/web1/2014/10
/data/access/web1/2014/10/access-20141001.log
[... etc ...]
/data/access/web2/2014/09
/data/access/web2/2014/09/access-20140901.log
[... etc ...]
/data/access/web2/2014/10
/data/access/web2/2014/10/access-20141001.log
[... etc ...]
我可以创建一个外部表:
I'm able to create an external table:
CREATE EXTERNAL TABLE access(
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") ([^ "]*|"[^"]*"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s")
LOCATION '/data/access/'
...尽管 Hive 不会进入子文件夹,除非我在运行 Hive 查询之前运行以下命令:
... though Hive doesn't descend into the subfolders unless I run the following commands before running the Hive query:
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
我看到其他帖子在表级别设置这些属性(例如 使用 tblproperties 创建 Hive 外部表的问题):
I've seen other posts set these properties at the table-level (e.g. Issue creating Hive External table using tblproperties):
TBLPROPERTIES ("hive.input.dir.recursive" = "TRUE",
"hive.mapred.supports.subdirectories" = "TRUE",
"hive.supports.subdirectories" = "TRUE",
"mapred.input.dir.recursive" = "TRUE");
不幸的是,这对我不起作用:当我查询该表时,它没有返回任何记录.我知道可以在 hive-site.xml 中设置这些属性,但如果我不需要,我宁愿不要进行任何可能影响其他用户的更改.
Unfortunately, this didn't work for me: the table doesn't return any records when I query it. I understand it's possible to set these properties in hive-site.xml, but I'd rather not make any changes that might impact other users if I don't need to.
Q) 有没有一种方法可以在不使用分区、进行站点范围的更改或每次运行这 4 个命令的情况下创建一个下降到子目录的表?
Q) is there a way to create a table that descends into the subdirectories without using partitions, making site-wide changes, or running those 4 commands every time?
推荐答案
在 HDInsight 中使用 Hive,我在 Hive 查询中创建外部表之前设置了以下属性,它适用于我.
Using Hive in HDInsight, I set the following properties before I create my external table in the Hive query and it works for me.
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
这篇关于Hive 可以递归地进入没有分区或编辑 hive-site.xml 的子目录吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!