业务场景中Hive解析Json常用案例

json在线工具

  • json格式转换在线工具
https://tool.lu/json/

业务场景中Hive解析Json常用案例-LMLPHP

  • format格式互转:
// 格式化可以合并整行显示
{"name":"John Doe","age":35,"email":"john@example.com"}

// 格式化可以展开显示,示例数据更清晰
{
  "name": "John Doe",
  "age": 35,
  "email": "john@example.com"
}
  • 转化常用yaml格式

工具提供了各种跨格式转化

age: 35
email: john@example.com
name: 'John Doe'

根据json串自动生成生成 Java 类等功能也非常实用

示例准备

需要Hive环境

  • test测试表
-- 在Hive中建测试表
CREATE TABLE json_test_table (
    id INT,
    json_column STRING
) STORED AS TEXTFILE;

示例 1: 提取单个字段

假设 JSON 数据包含用户信息

{
  "name": "John Doe",
  "age": 35,
  "email": "john@example.com"
}
  • 样例数据
insert into json_test_table values (1,'{"name":"John Doe","age":35,"email":"john@example.com"}');

select * from json_test_table where id = 1;
+------+---------------------------------------------------------+
| id   | json_column                                             |
+------+---------------------------------------------------------+
|    1 | {"name":"John Doe","age":35,"email":"john@example.com"} |
+------+---------------------------------------------------------+

使用 get_json_object 函数从 JSON 中提取字段:

SELECT get_json_object(json_column, '$.name') AS name FROM json_test_table where id = 1;

+-----------+
|   name    |
+-----------+
| John Doe  |
+-----------+
1 row selected (19.655 seconds)


SELECT get_json_object(json_column, '$.name') AS name,get_json_object(json_column, '$.age') AS age FROM json_test_table where id = 1;
+-----------+------+
|   name    | age  |
+-----------+------+
| John Doe  | 35   |
+-----------+------+
1 row selected (19.396 seconds)

示例 2: 提取嵌套字段

在 JSON 数据中有嵌套字段的情况下:

{
  "user": {
    "name": "Alice",
    "address": {
      "city": "New York",
      "zipcode": "10001"
    }
  }
}
  • 样例数据
insert into json_test_table values (2,'{"user":{"name":"Alice","address":{"city":"New York","zipcode":"10001"}}}');

select * from json_test_table where id = 2;
+------+---------------------------------------------------------------------------+
| id   | json_column                                                               |
+------+---------------------------------------------------------------------------+
|    2 | {"user":{"name":"Alice","address":{"city":"New York","zipcode":"10001"}}} |
+------+---------------------------------------------------------------------------+

提取嵌套字段的值:

SELECT get_json_object(json_column, '$.user.name') AS user_name,
       get_json_object(json_column, '$.user.address.city') AS city
FROM json_test_table where id = 2;

+------------+-----------+
| user_name  |   city    |
+------------+-----------+
| Alice      | New York  |
+------------+-----------+
1 row selected (19.38 seconds)

示例 3: 提取数组中的值

假设 JSON 数据包含一个数组:

{
  "tags": ["apple", "banana", "orange"]
}
  • 样例数据
insert into json_test_table values (3,'{"tags":["apple","banana","orange"]}');

select * from json_test_table where id = 3;
+------+--------------------------------------+
| id   | json_column                          |
+------+--------------------------------------+
|    3 | {"tags":["apple","banana","orange"]} |
+------+--------------------------------------+

从数组中提取值:

SELECT get_json_object(json_column, '$.tags[0]') AS first_tag,
       get_json_object(json_column, '$.tags[1]') AS second_tag
FROM json_test_table where id = 3;

+------------+-------------+
| first_tag  | second_tag  |
+------------+-------------+
| apple      | banana      |
+------------+-------------+
1 row selected (18.488 seconds)

示例 4: 提取数组列表长度

获取数组的长度:

-- 复用 示例3中样例数据
select * from json_test_table where id = 3;
+------+--------------------------------------+
| id   | json_column                          |
+------+--------------------------------------+
|    3 | {"tags":["apple","banana","orange"]} |
+------+--------------------------------------+
-- 可以看到数组tags对应的值有3个


SELECT size(split(get_json_object(json_column, '$.tags'), ',')) AS tags_length
FROM json_test_table
WHERE id = 3;

+--------------+
| tags_length  |
+--------------+
| 3            |
+--------------+
1 row selected (35.766 seconds)

示例 5:业务场景-解析公司基本信息案例

  • 假设JSON数据包含公司的基本信息,比如公司名称、注册资本等字段
{
  "company_name": "wangting_company",
  "registration": {
    "registered_capital": 1000000,
    "registered_date": "2020-01-01",
    "registered_address": "123 Main St, City"
  }
}

{"company_name":"wangting_company","registration":{"registered_capital":1000000,"registered_date":"2020-01-01","registered_address":"123 Main St, City"}}
-- 创建表
CREATE TABLE company_info (
    company_name STRING,
    registered_capital INT,
    registered_date STRING,
    registered_address STRING
) STORED AS TEXTFILE;

-- 转化插入数据
INSERT INTO company_info
SELECT
    get_json_object(json_data, '$.company_name') AS company_name,
    get_json_object(json_data, '$.registration.registered_capital') AS registered_capital,
    get_json_object(json_data, '$.registration.registered_date') AS registered_date,
    get_json_object(json_data, '$.registration.registered_address') AS registered_address
FROM
    (SELECT '{"company_name": "wangting_company", "registration": {"registered_capital": 1000000, "registered_date": "2020-01-01", "registered_address": "123 Main St, City"}}' AS json_data) t;
    
+----------------------------+----------------------------------+-------------------------------+----------------------------------+
| company_info.company_name  | company_info.registered_capital  | company_info.registered_date  | company_info.registered_address  |
+----------------------------+----------------------------------+-------------------------------+----------------------------------+
| wangting_company           | 1000000                          | 2020-01-01                    | 123 Main St, City                |
+----------------------------+----------------------------------+-------------------------------+----------------------------------+
1 row selected (0.411 seconds)

示例 6: 使用 LATERAL VIEW 解析数组

使用 LATERAL VIEWexplode 解析 JSON 数组:

-- 复用示例3样例数据
SELECT id, tag
FROM json_test_table
LATERAL VIEW explode(split(get_json_object(json_column, '$.tags'), ',')) exploded_tags AS tag
WHERE id = 3;

+-----+------------+
| id  |    tag     |
+-----+------------+
| 3   | ["apple"   |
| 3   | "banana"   |
| 3   | "orange"]  |
+-----+------------+
3 rows selected (17.318 seconds)

-- 仅展示解析数据,实际使用如需去除方括号,可以一并清洗

示例 7: 提取数组对象的值

如果数组包含对象,则提取对象的值:

{
  "users": [
    {
      "name": "Alice",
      "age": 28
    },
    {
      "name": "Bob",
      "age": 35
    }
  ]
}

{"users":[{"name":"Alice","age":28},{"name":"Bob","age":35}]}
  • 样例数据
insert into json_test_table values (7,'{"users":[{"name":"Alice","age":28},{"name":"Bob","age":35}]}');

select * from json_test_table where id = 7;
+------+---------------------------------------------------------------+
| id   | json_column                                                   |
+------+---------------------------------------------------------------+
|    7 | {"users":[{"name":"Alice","age":28},{"name":"Bob","age":35}]} |
+------+---------------------------------------------------------------+
SELECT get_json_object(json_column, '$.users[0].name') AS user1_name,
       get_json_object(json_column, '$.users[1].name') AS user2_name
FROM json_test_table where id = 7;

+-------------+-------------+
| user1_name  | user2_name  |
+-------------+-------------+
| Alice       | Bob         |
+-------------+-------------+
1 row selected (17.372 seconds)

示例 8: 处理日期

处理 JSON 中的日期字段:

{
  "event_date": "2023-11-21"
}

{"event_date":"2023-11-21"}
  • 样例数据
insert into json_test_table values (8,'{"event_date":"2023-11-21"}');

select * from json_test_table where id = 8;
+------+-----------------------------+
| id   | json_column                 |
+------+-----------------------------+
|    8 | {"event_date":"2023-11-21"} |
+------+-----------------------------+
SELECT get_json_object(json_column, '$.event_date') AS event_date
FROM json_test_table where id = 8;

+-------------+
| event_date  |
+-------------+
| 2023-11-21  |
+-------------+
1 row selected (17.436 seconds)


SELECT id, CAST(get_json_object(json_column, '$.event_date') AS DATE) AS event_date,
       YEAR(CAST(get_json_object(json_column, '$.event_date') AS DATE)) AS event_year,
       MONTH(CAST(get_json_object(json_column, '$.event_date') AS DATE)) AS event_month
FROM json_test_table
WHERE id = 8;

+-----+-------------+-------------+--------------+
| id  | event_date  | event_year  | event_month  |
+-----+-------------+-------------+--------------+
| 8   | 2023-11-21  | 2023        | 11           |
+-----+-------------+-------------+--------------+
1 row selected (17.363 seconds)

示例 9: 提取布尔值

提取 JSON 中的布尔字段:

{
  "is_active": true
}

{"is_active":true}
  • 样例数据
insert into json_test_table values (9,'{"is_active":true}');

select * from json_test_table where id = 9;
+------+--------------------+
| id   | json_column        |
+------+--------------------+
|    9 | {"is_active":true} |
+------+--------------------+
SELECT get_json_object(json_column, '$.is_active') AS is_active
FROM json_test_table where id = 9;
+------------+
| is_active  |
+------------+
| true       |
+------------+
1 row selected (18.401 seconds)


--  使用WHERE子句根据布尔值进行条件过滤。
SELECT *
FROM json_test_table
WHERE get_json_object(json_column, '$.is_active') = 'true';

+---------------------+------------------------------+
| json_test_table.id  | json_test_table.json_column  |
+---------------------+------------------------------+
| 9                   | {"is_active":true}           |
+---------------------+------------------------------+
1 row selected (17.327 seconds)

示例 10: json_tuple计算 JSON 对象数量

计算 JSON 对象中包含的键值对数量:

SELECT json_tuple(json_column, 'name', 'age') AS (name, age)
FROM json_test_table;

+-----------+-------+
|   name    |  age  |
+-----------+-------+
| John Doe  | 35    |
| NULL      | NULL  |
| NULL      | NULL  |
| NULL      | NULL  |
| NULL      | NULL  |
| NULL      | NULL  |
+-----------+-------+
6 rows selected (17.344 seconds)

-- 只有id=1符合条件 不符合条件均为NULL

示例 11: 使用 json_tuple 提取值

从 JSON 对象中使用 json_tuple 提取多个字段:

{
  "name": "John",
  "age": 30,
  "address": "New York"
}

{"name":"John","age":30,"address":"New York"}
  • 样例数据
insert into json_test_table values (11,'{"name":"John","age":30,"address":"New York"}');

select * from json_test_table where id = 11;
+------+-----------------------------------------------+
| id   | json_column                                   |
+------+-----------------------------------------------+
|   11 | {"name":"John","age":30,"address":"New York"} |
+------+-----------------------------------------------+
SELECT json_tuple(json_column, 'name', 'age', 'address') AS (name, age, address)
FROM json_test_table where id = 11;

+-------+------+-----------+
| name  | age  |  address  |
+-------+------+-----------+
| John  | 30   | New York  |
+-------+------+-----------+
1 row selected (17.414 seconds)

示例 12: 提取数字字段

从 JSON 中提取数字字段:

{
  "count": 50
}

{"count":50}
  • 样例数据
insert into json_test_table values (12,'{"count":50}');

select * from json_test_table where id = 12;
+------+--------------+
| id   | json_column  |
+------+--------------+
|   12 | {"count":50} |
+------+--------------+
SELECT get_json_object(json_column, '$.count') AS count
FROM json_test_table where id = 12;

+--------+
| count  |
+--------+
| 50     |
+--------+
1 row selected (19.401 seconds)

示例 13: 使用 get_json_object 过滤数据

根据 JSON 中的字段值进行过滤:

SELECT *
FROM json_test_table
WHERE get_json_object(json_column, '$.age') > 30;

+---------------------+----------------------------------------------------+
| json_test_table.id  |            json_test_table.json_column             |
+---------------------+----------------------------------------------------+
| 1                   | {"name":"John Doe","age":35,"email":"john@example.com"} |
+---------------------+----------------------------------------------------+
1 row selected (18.402 seconds)


SELECT *
FROM json_test_table
WHERE get_json_object(json_column, '$.age') > 10;

+---------------------+----------------------------------------------------+
| json_test_table.id  |            json_test_table.json_column             |
+---------------------+----------------------------------------------------+
| 1                   | {"name":"John Doe","age":35,"email":"john@example.com"} |
| 11                  | {"name":"John","age":30,"address":"New York"}      |
+---------------------+----------------------------------------------------+
2 rows selected (17.298 seconds)

示例 14: 处理空值

处理 JSON 数据中可能的空值:

{
  "status": null
}

{"status":null}
  • 样例数据
insert into json_test_table values (14,'{"status":null}');

select * from json_test_table where id = 14;
+------+-----------------+
| id   | json_column     |
+------+-----------------+
|   14 | {"status":null} |
+------+-----------------+
SELECT get_json_object(json_column, '$.status') AS status
FROM json_test_table where id = 14;

+---------+
| status  |
+---------+
| NULL    |
+---------+
1 row selected (17.345 seconds)


--  使用IS NULL或IS NOT NULL来检查字段是否为空
SELECT id, json_column
FROM json_test_table
WHERE id = 14 AND get_json_object(json_column, '$.status') IS NULL;

+-----+------------------+
| id  |   json_column    |
+-----+------------------+
| 14  | {"status":null}  |
+-----+------------------+
1 row selected (17.47 seconds)



-- 使用CASE语句对NULL值进行处理或者转换为其他值。
SELECT id, CASE WHEN get_json_object(json_column, '$.status') IS NULL THEN 'No Status' ELSE get_json_object(json_column, '$.status') END AS status
FROM json_test_table
WHERE id = 14;

+-----+------------+
| id  |   status   |
+-----+------------+
| 14  | No Status  |
+-----+------------+
1 row selected (17.31 seconds)

示例 15: 使用 get_json_object 判断是否存在字段

判断 JSON 是否包含特定字段:

SELECT 
  CASE WHEN get_json_object(json_column, '$.name') IS NOT NULL THEN 'Exists' ELSE 'Not Exists' END AS field_status
FROM json_test_table;
+---------------+
| field_status  |
+---------------+
| Exists        |
| Not Exists    |
| Exists        |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
+---------------+


SELECT 
  CASE WHEN get_json_object(json_column, '$.age') IS NOT NULL THEN 'Exists' ELSE 'Not Exists' END AS field_status
FROM json_test_table;

+---------------+
| field_status  |
+---------------+
| Exists        |
| Not Exists    |
| Exists        |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
+---------------+


SELECT 
  CASE WHEN get_json_object(json_column, '$.aaaaaaaaa') IS NOT NULL THEN 'Exists' ELSE 'Not Exists' END AS field_status
FROM json_test_table;

+---------------+
| field_status  |
+---------------+
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
| Not Exists    |
+---------------+
9 rows selected (17.337 seconds)

示例 16: 提取多层嵌套对象的值

从多层嵌套的 JSON 对象中提取值:

{
  "user": {
    "details": {
      "name": "Alice",
      "age": 30
    }
  }
}

{"user":{"details":{"name":"Alice","age":30}}}
  • 样例数据
insert into json_test_table values (16,'{"user":{"details":{"name":"Alice","age":30}}}');

select * from json_test_table where id = 16;
+------+------------------------------------------------+
| id   | json_column                                    |
+------+------------------------------------------------+
|   16 | {"user":{"details":{"name":"Alice","age":30}}} |
+------+------------------------------------------------+
SELECT 
  get_json_object(json_column, '$.user.details.name') AS user_name,
  get_json_object(json_column, '$.user.details.age') AS user_age
FROM json_test_table where id = 16;

+------------+-----------+
| user_name  | user_age  |
+------------+-----------+
| Alice      | 30        |
+------------+-----------+
1 row selected (16.253 seconds)

示例 17: 提取嵌套对象数组的值

从嵌套的对象数组中提取值:

{
  "employees": [
    {
      "name": "Alice",
      "department": "HR"
    },
    {
      "name": "Bob",
      "department": "Engineering"
    }
  ]
}

{"employees":[{"name":"Alice","department":"HR"},{"name":"Bob","department":"Engineering"}]}
  • 样例数据
insert into json_test_table values (17,'{"employees":[{"name":"Alice","department":"HR"},{"name":"Bob","department":"Engineering"}]}');

select * from json_test_table where id = 17;
+------+----------------------------------------------------------------------------------------------+
| id   | json_column                                                                                  |
+------+----------------------------------------------------------------------------------------------+
|   17 | {"employees":[{"name":"Alice","department":"HR"},{"name":"Bob","department":"Engineering"}]} |
+------+----------------------------------------------------------------------------------------------+
SELECT 
  get_json_object(json_column, '$.employees[0].name') AS employee1_name,
  get_json_object(json_column, '$.employees[1].name') AS employee2_name
FROM json_test_table where id = 17;

+-----------------+-----------------+
| employee1_name  | employee2_name  |
+-----------------+-----------------+
| Alice           | Bob             |
+-----------------+-----------------+
1 row selected (17.273 seconds)

示例 18: 提取数组对象的多个值

取数组对象的多个值

从数组对象中提取多个值:

{
  "items": [
    {
      "id": 1,
      "name": "Item 1"
    },
    {
      "id": 2,
      "name": "Item 2"
    }
  ]
}

{"items":[{"id":1,"name":"Item 1"},{"id":2,"name":"Item 2"}]}
  • 样例数据
insert into json_test_table values (18,'{"items":[{"id":1,"name":"Item 1"},{"id":2,"name":"Item 2"}]}');

select * from json_test_table where id = 18;
+------+---------------------------------------------------------------+
| id   | json_column                                                   |
+------+---------------------------------------------------------------+
|   18 | {"items":[{"id":1,"name":"Item 1"},{"id":2,"name":"Item 2"}]} |
+------+---------------------------------------------------------------+
SELECT 
  get_json_object(json_column, '$.items[0].id') AS item1_id,
  get_json_object(json_column, '$.items[0].name') AS item1_name,
  get_json_object(json_column, '$.items[1].id') AS item2_id,
  get_json_object(json_column, '$.items[1].name') AS item2_name
FROM json_test_table where id = 18;

+-----------+-------------+-----------+-------------+
| item1_id  | item1_name  | item2_id  | item2_name  |
+-----------+-------------+-----------+-------------+
| 1         | Item 1      | 2         | Item 2      |
+-----------+-------------+-----------+-------------+
1 row selected (17.386 seconds)

示例19: json_serde解析映射数据文件的使用介绍

  • json_serde库提供了一种在Hive中直接将JSON格式的数据解析成表的方式,可简化处理流程。

如果现在有一个json文件:

[app@ali-wangting wangt]$ cat wangt.json

{"name":"wangting01","age":21,"email":"wangting01@example.com"}
{"name":"wangting02","age":22,"email":"wangting02@example.com"}
{"name":"wangting03","age":23,"email":"wangting03@example.com"}
{"name":"wangting04","age":24,"email":"wangting04@example.com"}
{"name":"wangting05","age":25,"email":"wangting05@example.com"}
{"name":"wangting06","age":26,"email":"wangting06@example.com"}
{"name":"wangting07","age":27,"email":"wangting07@example.com"}
{"name":"wangting08","age":28,"email":"wangting08@example.com"}

现在希望建立一张表,使得表可以直接映射上数据文件

  • hive中建表
CREATE TABLE IF NOT EXISTS wangt_666_json(
  name string,
  age int,
  email string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile;


desc wangt_json_table;
+-----------+------------+--------------------+
| col_name  | data_type  |      comment       |
+-----------+------------+--------------------+
| name      | string     | from deserializer  |
| age       | int        | from deserializer  |
| email     | string     | from deserializer  |
+-----------+------------+--------------------+
  • 将json文件上传至建表对应hdfs路径
[app@ali-wangting wangt]$ hdfs dfs -ls /user/hive/warehouse/wangt.db/wangt_666_json/
[app@ali-wangting wangt]$ hdfs dfs -put wangt.json /user/hive/warehouse/wangt.db/wangt_666_json/
[app@ali-wangting wangt]$ hdfs dfs -ls /user/hive/warehouse/wangt.db/wangt_666_json/
Found 1 items
-rwxrwx--x+  2 hive hive        512 2023-12-04 16:38 /user/hive/warehouse/wangt.db/wangt_666_json/wangt.json

# 也可以通过hive的sql命令行直接load数据文件
load data local inpath '/home/wangt/wangt.json' into table wangt_666_json;
  • 查看表数据验证
select * from wangt_666_json;
+----------------------+---------------------+-------------------------+
| wangt_666_json.name  | wangt_666_json.age  |  wangt_666_json.email   |
+----------------------+---------------------+-------------------------+
| wangting01           | 21                  | wangting01@example.com  |
| wangting02           | 22                  | wangting02@example.com  |
| wangting03           | 23                  | wangting03@example.com  |
| wangting04           | 24                  | wangting04@example.com  |
| wangting05           | 25                  | wangting05@example.com  |
| wangting06           | 26                  | wangting06@example.com  |
| wangting07           | 27                  | wangting07@example.com  |
| wangting08           | 28                  | wangting08@example.com  |
+----------------------+---------------------+-------------------------+
8 rows selected (0.5 seconds)

可以看到数据文件8行记录,成功映射对应了hive中wangt_666_json表的8条数据

12-04 23:05