问题描述
{key :somehashvalue,columns:[[Event:2014-03-26 00 \\:29 \\:13 + 0200:json,{\user \:{ \ credType\:\ ADDRESS\,\ credValue\:\ 01:AA:A4:G1:HH:UU\,\ cAgent\:空,\ cType\:\ ACE\},\ timestamp\:1395786553,\ sessionId\:1395785353,\ className\:\ Event\ \ subtype\:\ CURRENTLYACTIVE\,\ vType\:\ TEST\,\ vId\:1235080,\ eType \:\CURRENTLYACTIVE \,\eData \:\1 \},1395786553381001],[Event:2014-03-26 00 \\\:29 \\:13 + 0200:,,1395786553381001]]}
使用Json serde s解析上面的json到我的配置单元列。但是,上面的1395786553381001不存在SerDe可以映射到Hive列的格式,即此值不存在Key(因为Hive理解以下Json列/值:)
因此,我采用了数组类型的方法并创建了一个表 -
CREATE TABLE mytesttable(
key string ,
列数组< array< string>>
)
ROW FORMAT SERDE'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA LOCAL INPATH'/home/user/testsample.json'
OVERWRITE INTO TABLE mytesttable;
从mytesttable中选择列[0] [1];
给出 -
{user:{credType:ADDRESS,credValue:01 :AA:A4:G1:HH:UU, cAgent:空, CTYPE: ACE}, 时间戳:1395786553 的sessionId:1395785353, 类名: 事件, 亚型 :CURRENTLYACTIVE,vType:TEST,vId:1235080,eType:CURRENTLYACTIVE,eData:1}
上面显示干净,但是我还需要列[*] [2],即在Json hive列中进行进一步转换。
我编写了一个正则表达式配置单元查询来清理存在于'/ home / user / testsample.json'
中的原始Json(假设它出现在表中tablewithinputjson)
SELECT
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(ij.columna,'[] [{]', '{'),'[}] [']'','}'),'''''')作为columna
从tablewithinputjson ij;
上述查询返回 -
{key:somehashvalue,columns:[[Event:2014-03-26 00:29:13 + 0200:json,{user:{credType : 地址, credValue: 01:AA:A4:G1:HH:UU, cAgent:空, CTYPE: ACE}, 时间戳:1395786553 的sessionId:1395785353 类名: 事件, 子类型: CURRENTLYACTIVE, V型: TEST, VID:1235080, ETYPE: CURRENTLYACTIVE, EDATA: 1},1395786553381001] ,[Event:2014-03-26 00:29:13 + 0200:,,1395786553381001]]}
但是在这里,1395786553381001不能映射到配置单元列,因为它出现在后面,而不是后面:或者更具体地说,这个值在没有密钥的情况下出现。
(我可以添加测试:在1395786553381001之前,但我不想自定义输入数据 - 因为a)太多的定制是我不舒服的东西b)似乎不是一个好的解决方案c )这将不必浪费我的hadoop集群空间和时间)
不要混淆任何进一步,我不能提出一个完全解析和映射的Hive表格格式原始Json片段中的所有字段。
欢迎任何建议。请让我知道如果它看起来太混乱。
发布端到端解决方案。逐步过程将JSON转换为配置表格:
步骤1)安装maven,如果尚未存在
> $ sudo apt-get install maven
步骤2)安装git,如果没有的话
> sudo git clone https://github.com/rcongiu/Hive-JSON-Serde.git
步骤3)进入$ HOME / HIVE-JSON_Serde文件夹步骤4)构建serde软件包
> sudo mvn -Pcdh5 clean package
步骤5) serde文件将位于
$ HOME / Hive-JSON-Serde / json-serde / target / json-serde-1.3.7-SNAPSHOT-jar -with-dependencies.jar
步骤6)将serde添加为配置单元中的依赖关系jar
hive>添加JAR $ HOME / Hive-JSON-Serde / json-serde / target / json-serde-1.3.7- SNAPSHOT-jar -with-dependencies.jar;
步骤7)在$ HOME / books.json中创建json文件(示例)
{value:[{id:1,bookname:A,properties :1year,unit:3}},{id:2,bookname:B,properties:{subscription:2years,unit 5)}}]}
步骤8)在配置单元中创建tmp1表
hive> CREATE TABLE tmp1(
value ARRAY< struct< id:string,bookname:string,properties:struct< subscription:string,unit:字符串>>>
)
ROW FORMAT SERDE'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES(
'mapping.value'='value'
)
保存为TEXTFILE;
步骤9)将数据从json载入到tmp1表中
> LOAD DATA LOCAL INPATH'$ HOME / books.json'INTO TABLE tmp1;
第10步)创建一个tmp2表来做爆炸操作表单tmp1,这个中间步骤是打破多重请注意:如果您的JSON结构很简单且单一级别,请避免此步骤
hive> ;将表tmp2创建为
SELECT *
FROM tmp1
LATERAL VIEW explode(value)itemTable AS items;
步骤11)创建配置表格并从tmp2表格加载值
hive>将表格书籍创建为
选择值[0] .id作为标识,值[0] .bookname作为名称,值[0]。作为订阅的properties.subscription,值为[0] .properties.unit作为来自tmp2的单元;
步骤12)drop tmp表格
hive> drop table tmp1;
hive> drop table tmp2;
步骤13)测试配置单元表
hive> select * from books;
输出:
id名称订阅单元
1 B 1年3
2 B 2年5
I have an original input json snippet ('/home/user/testsample.json') -
{"key": "somehashvalue","columns": [["Event:2014-03-26 00\\:29\\:13+0200:json","{\"user\":{\"credType\":\"ADDRESS\",\"credValue\":\"01:AA:A4:G1:HH:UU\",\"cAgent\":null,\"cType\":\"ACE\"},\"timestamp\":1395786553,\"sessionId\":1395785353,\"className\":\"Event\",\"subtype\":\"CURRENTLYACTIVE\",\"vType\":\"TEST\",\"vId\":1235080,\"eType\":\"CURRENTLYACTIVE\",\"eData\":\"1\"}",1395786553381001],["Event:2014-03-26 00\\:29\\:13+0200:","",1395786553381001]]}
I tried to use Json serde s to parse the above json to my hive columns. However, 1395786553381001 above is not present in a format which SerDe can map to a Hive column i.e this value is present without a Key (since Hive understands Json columns/values present after :)
So instead I took the Array type approach and created a table -
CREATE TABLE mytesttable (
key string,
columns array < array< string > >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA LOCAL INPATH '/home/user/testsample.json'
OVERWRITE INTO TABLE mytesttable;
select columns[0][1] from mytesttable;gives -
{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"}
The above appears clean, but then I also need columns[*][2] i.e in a Json hive column for further transformations.
I wrote a regex hive query to cleanse the original Json present in '/home/user/testsample.json'
(assume it is present in a table tablewithinputjson)
SELECT
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(ij.columna, '["][{]', '{'),'[}]["]', '}'), '\\\\', '') AS columna
FROM tablewithinputjson ij;
The above query returns -
{"key": "somehashvalue","columns": [["Event:2014-03-26 00:29:13+0200:json",{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"},1395786553381001],["Event:2014-03-26 00:29:13+0200:","",1395786553381001]]}
But here again, 1395786553381001 cannot be mapped to a hive column since it appears after , and not after : or more specifically this value is present without a key.(I could add "test": before 1395786553381001 , but I do not want to customize the input data - since a) Too much customization is something I am not comfortable with b) does not seem to be a good solution c) it would unnecessary waste my hadoop cluster space and time)
Not to confuse any further, I am not able to come up with a Hive table format that fully parses and maps all the fields in the original Json snippet.Any suggestions are welcome. Please let me know If it seems too confusing.
posting End-to-End solution. Step by step procedure to convert JSON to hive table:
step 1) install maven if not there already
>$ sudo apt-get install maven
step 2) install git if not there already
>sudo git clone https://github.com/rcongiu/Hive-JSON-Serde.git
step 3) go into the $HOME/HIVE-JSON_Serde folder
step 4) build the serde package
>sudo mvn -Pcdh5 clean package
step 5) The serde file will be in $HOME/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar
step 6) Add serde as dependency jar in hive
hive> ADD JAR $HOME/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7- SNAPSHOT-jar-with-dependencies.jar;
step 7) create json file in $HOME/books.json (Example)
{"value": [{"id": "1","bookname": "A","properties": {"subscription": "1year","unit": "3"}},{"id": "2","bookname":"B","properties":{"subscription": "2years","unit": "5"}}]}
step 8) create tmp1 table in hive
hive>CREATE TABLE tmp1 (
value ARRAY<struct<id:string,bookname:string,properties:struct<subscription:string,unit:string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.value' = 'value'
)
STORED AS TEXTFILE;
step 9) load the data from json to tmp1 table
>LOAD DATA LOCAL INPATH '$HOME/books.json' INTO TABLE tmp1;
step 10) create a tmp2 table to do explode operation form tmp1, this intermediate step is to break multi level json structure into multiple rowsNote: if your JSON structure is simple and single level , avoid this step
hive>create table tmp2 as
SELECT *
FROM tmp1
LATERAL VIEW explode(value) itemTable AS items;
step 11) create hive table and load the values from tmp2 table
hive>create table books as
select value[0].id as id, value[0].bookname as name, value[0].properties.subscription as subscription, value[0].properties.unit as unit from tmp2;
step 12) drop tmp tables
hive>drop table tmp1;
hive>drop table tmp2;
step 13) test the hive table
hive>select * from books;
output:
id name subscription unit
1 B 1year 3
2 B 2years 5
这篇关于Hive用于复杂的嵌套Json的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!