本文介绍了将嵌套的 Json 转换为 Pyspark 中的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从具有嵌套字段和日期字段的 json 创建数据框,我想将其连接:

root|-- 模型:字符串(可为空 = 真)|-- 代码:字符串(可为空 = 真)|-- START_Time: struct (nullable = true)||-- 天:字符串(可为空 = 真)||-- 小时:字符串(可为空 = 真)||-- 分钟:字符串(可为空 = 真)||-- 月:字符串(可为空 = 真)||-- 第二个:字符串(可为空 = 真)||-- 年:字符串(可为空 = 真)|-- 权重:字符串(可为空 = 真)|-- 注册: struct (nullable = true)||-- 天:字符串(可为空 = 真)||-- 小时:字符串(可为空 = 真)||-- 分钟:字符串(可为空 = 真)||-- 月:字符串(可为空 = 真)||-- 第二个:字符串(可为空 = 真)||-- 年:字符串(可为空 = 真)|-- 总计:字符串(可为空 = 真)|-- 计划:结构(可为空 = 真)||-- 天:长(可为空 = 真)||-- 小时:长(可为空 = 真)||-- 分钟:长(可为空 = 真)||-- 月:长(可为空 = 真)||-- 秒:长(可为空 = 真)||-- 年:长(可为空 = 真)|-- 包:字符串(可为空 = 真)

目标是获得更像:

+---------+------------------+----------+-----------------+----------+-----------------+|型号 |START_时间 |重量 |已登记 |总计 |已安排 |+---------+----------+----------+-----------------+----------+------------------+|.........|yy-mm-dd-hh-mm-ss|重量 |yy-mm-dd-hh-mm-ss|总计 |yy-mm-dd-hh-mm-ss|

其中 yy-mm-dd-hh-mm-ss 是 json 中的日、小时、分钟....

|-- 例子:struct (nullable = true)||-- 天:字符串(可为空 = 真)||-- 小时:字符串(可为空 = 真)||-- 分钟:字符串(可为空 = 真)||-- 月:字符串(可为空 = 真)||-- 第二个:字符串(可为空 = 真)||-- 年:字符串(可为空 = 真)

我尝试过explode功能可能没有按预期使用但没有用谁能激励我寻求解决方案谢谢

解决方案

您可以通过以下简单步骤完成.

  1. 让我们在 data.json 文件中有如下数据

{MODEL":abc",CODE":CODE1",START_Time":{day":05",hour":08",分钟":30",月":08",秒":30",年":21"},重量":231",REGISTED":{day":05",hour":08",minute":30",month":08",second";: "30", "year": "21"}, "TOTAL": "1", "SCHEDULED": {"day": "05", "hour": ";08"、分钟":30"、月":08"、秒":30"、年":21"}、包":汽车"}

此数据与您共享的架构相同.

  1. 在 pyspark 中读取这个 json 文件,如下所示.

    from pyspark.sql.functions import *df = spark.read.json('data.json')

  2. 现在您可以读取嵌套值并修改列值,如下所示.

    df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), col('START_Time.month'), lit('-'), col('START_Time.day'), lit('-'), col('START_Time.hour'), lit('-'), col('START_Time.minute'), lit('-'), col('START_Time.second'))).withColumn('REGISTED',concat(col('REGISTED.year'), lit('-'), col('REGISTED.month'), lit('-'), col('REGISTED.day'), lit('-'), col('REGISTED.hour'), lit('-'), col('REGISTED.minute'), lit('-'), col('REGISTED.second'))).withColumn('SCHEDULED',concat(col('SCHEDULED.year'), lit('-'), col('SCHEDULED.month'), lit('-'), col('SCHEDULED.day'), lit('-'), col('SCHEDULED.hour'), lit('-'), col('SCHEDULED.minute'), lit('-'), col('SCHEDULED.second'))).表演()

输出为

代码型号包装注册预定START_时间总计重量
CODE1abc汽车21-08-05-08-30-3021-08-05-08-30-3021-08-05-08-30-301231

I'm trying to create a dataframe from a json with nested feilds and dates feilds that i'd like to concatenate :

root
 |-- MODEL: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- START_Time: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
 |-- WEIGHT: string (nullable = true)
 |-- REGISTED: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
 |-- TOTAL: string (nullable = true)
 |-- SCHEDULED: struct (nullable = true)
 |    |-- day: long (nullable = true)
 |    |-- hour: long (nullable = true)
 |    |-- minute: long (nullable = true)
 |    |-- month: long (nullable = true)
 |    |-- second: long (nullable = true)
 |    |-- year: long (nullable = true)
 |-- PACKAGE: string (nullable = true)

objective is to get a result more like :

+---------+------------------+----------+-----------------+---------+-----------------+
|MODEL    |   START_Time     | WEIGHT   |REGISTED         |TOTAL    |SCHEDULED        |
+---------+------------------+----------+-----------------+---------+-----------------+
|.........| yy-mm-dd-hh-mm-ss| WEIGHT   |yy-mm-dd-hh-mm-ss|TOTAL    |yy-mm-dd-hh-mm-ss|

where yy-mm-dd-hh-mm-ss are the conactenation of: day, hour, minute.... in the json

|-- example: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)

i have tried explode function may be didn't use it as it should but didn't workcan anyone inspire me for a solutionThank you

解决方案

You can do it in below simple steps.

  1. Lets we have the data as below in the data.json file

This data has the same schema as you shared.

  1. Read this json file in pyspark as below.

    from pyspark.sql.functions import *
    
    df = spark.read.json('data.json')
    

  2. Now you can read the nested values and modify the column values as below.

    df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), col('START_Time.month'), lit('-'), col('START_Time.day'), lit('-'), col('START_Time.hour'), lit('-'), col('START_Time.minute'), lit('-'), col('START_Time.second'))).withColumn('REGISTED',concat(col('REGISTED.year'), lit('-'), col('REGISTED.month'), lit('-'), col('REGISTED.day'), lit('-'), col('REGISTED.hour'), lit('-'), col('REGISTED.minute'), lit('-'), col('REGISTED.second'))).withColumn('SCHEDULED',concat(col('SCHEDULED.year'), lit('-'), col('SCHEDULED.month'), lit('-'), col('SCHEDULED.day'), lit('-'), col('SCHEDULED.hour'), lit('-'), col('SCHEDULED.minute'), lit('-'), col('SCHEDULED.second'))).show()
    

The output would be

CODE1abcCAR21-08-05-08-30-3021-08-05-08-30-3021-08-05-08-30-301231

这篇关于将嵌套的 Json 转换为 Pyspark 中的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 17:49