通过添加Hive/Athena的CSV文件中存在的两列来创建外部表 | Athena的CSV文件中存在的两列来创建外部表

本文介绍了通过添加Hive/Athena的CSV文件中存在的两列来创建外部表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含三列A，B，C的CSV文件.

I have a CSV file having three columns A, B, C.

我想在它的顶部创建一个具有两列A(B + C)的外部表.

I would like to create an external table on top of it having two columns A, (B + C).

是否有可能在表创建过程中使它发生?或者在创建表后是否应该创建视图?

Is it possible to make it happen during the table creation itself or should I create a view once the table is created?

推荐答案

不能使用CSVSerDe或LasySimpleSerDe，但是可以使用RegexSerDe.表定义中的每一列都应在input.regex中具有对应的捕获组().

It is not possible using CSVSerDe or LasySimpleSerDe but possible using RegexSerDe.Each column in table definition should have corresponding capturing group () in input.regex.

例如，如果文件用逗号分隔，则可以这样定义表:

For example if file is comma separated, table can be defined like this:

 CREATE EXTERNAL TABLE mytable(
 colA string COMMENT '',
 colBC string COMMENT '')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'= '^"?(.*?)"?,(.*)$'
LOCATION ...

input.regex的意思是:

input.regex means:

^ -字符串锚点的开头
"?-可选引言
(.*?)-捕获第一列的组-任何时候任何字符都不贪婪
"?-可选引言
，-逗号
(.*)-第2列的捕获组(这将捕获从第一个逗号到结尾的所有内容)
$ -字符串锚点的结尾

^ - beginning of the string anchor
"? - optional quote
(.*?) - capturing group for 1st column - any character any times not greedy
"? - optional quote
, - comma
(.*) - capturing group for 2nd column (this will capture everything after first comma till the end)
$ - end of the string anchor

您可以编写可正确捕获数据的正则表达式.您可以在不创建表的情况下调试regex，请使用 regexp_replace :

You can write regex that will capture your data correctly.You can debug regex without creating table, use regexp_replace :

select regexp_replace('"A",B,C', --data string example
                      '^"?(.*?)"?,(.*)$', --regex with 2 capturing groups for 2 columns
                     '$1 | $2'); --pipe delimited columns should be returned

这篇关于通过添加Hive/Athena的CSV文件中存在的两列来创建外部表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！