问题描述
我有一个包含三列A,B,C的CSV文件.
I have a CSV file having three columns A, B, C.
我想在它的顶部创建一个具有两列A(B + C)的外部表.
I would like to create an external table on top of it having two columns A, (B + C).
是否有可能在表创建过程中使它发生?或者在创建表后是否应该创建视图?
Is it possible to make it happen during the table creation itself or should I create a view once the table is created?
推荐答案
不能使用CSVSerDe或LasySimpleSerDe,但是可以使用RegexSerDe.表定义中的每一列都应在input.regex中具有对应的捕获组().
It is not possible using CSVSerDe or LasySimpleSerDe but possible using RegexSerDe.Each column in table definition should have corresponding capturing group () in input.regex.
例如,如果文件用逗号分隔,则可以这样定义表:
For example if file is comma separated, table can be defined like this:
CREATE EXTERNAL TABLE mytable(
colA string COMMENT '',
colBC string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'= '^"?(.*?)"?,(.*)$'
LOCATION ...
input.regex的意思是:
input.regex means:
-
^
-字符串锚点的开头 -
"?
-可选引言 -
(.*?)
-捕获第一列的组-任何时候任何字符都不贪婪 -
"?
-可选引言 -
,
-逗号 -
(.*)
-第2列的捕获组(这将捕获从第一个逗号到结尾的所有内容) -
$
-字符串锚点的结尾
^
- beginning of the string anchor"?
- optional quote(.*?)
- capturing group for 1st column - any character any times not greedy"?
- optional quote,
- comma(.*)
- capturing group for 2nd column (this will capture everything after first comma till the end)$
- end of the string anchor
您可以编写可正确捕获数据的正则表达式.您可以在不创建表的情况下调试regex,请使用 regexp_replace
:
You can write regex that will capture your data correctly.You can debug regex without creating table, use regexp_replace
:
select regexp_replace('"A",B,C', --data string example
'^"?(.*?)"?,(.*)$', --regex with 2 capturing groups for 2 columns
'$1 | $2'); --pipe delimited columns should be returned
这篇关于通过添加Hive/Athena的CSV文件中存在的两列来创建外部表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!