将(年、月、日、小时、分钟、秒)的嵌套值分解为 Pyspark 数据框中一个字段中的日期时间类型

我拥有的数据: |-- MODEL: string (nullable = true)|-- START_Time: struct (nullable = true)||-- 天:字符串(可为空 = 真)||-- 小时:字符串(可为空 = 真)||-- 分钟:字符串(可为空 = 真)||-- 月:字符串(可为空 = 真)||-- 第二个:字符串(可为空 = 真)||-- 年:字符串(可为空 = 真)|-- 权重:字符串(可为空 = 真)|-- 注册:结构(可为空 = 真)||-- 天:字符串(可为空 = 真)||-- 小时:字符串(可为空 = 真)||-- 分钟:字符串(可为空 = 真)||-- 月:字符串(可为空 = 真)||-- 第二个:字符串(可为空 = 真)||-- 年:字符串(可为空 = 真)|-- 总计:字符串(可为空 = 真)我想要的结果:使用 START_TIME 和 REGISTRED 作为 DATE 类型+---------+------------------+----------+-----------------+---------+|型号 |START_时间 |重量 |已注册 |总计 |+---------+----------+----------+-----------------+---------+|.........|yy-mm-dd-hh-mm-ss|重量 |yy-mm-dd-hh-mm-ss|总计 |我试过了:df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), .....)但是当嵌套字段中有空值时，它会在 (-----)它让我:+---------+------------------+----------+-----------------+---------+|型号 |START_时间 |重量 |已注册 |总计 |+---------+----------+----------+-----------------+---------+|价值|----- |价值 |----- |价值| 解决方案拼接后，您只需将整列强制转换为 timestamp 类型，Spark 将为您处理丢失(和无效)的数据并返回 null 代替from pyspark.sql import 函数为 F(df.withColumn('raw_string_date', F.concat(F.col('START_TIME.year'),掠过('-')，F.col('START_TIME.month'),掠过('-')，F.col('START_TIME.day'),F.lit(' '),F.col('START_TIME.hour'),掠过(':')，F.col('START_TIME.minute'),掠过(':')，F.col('START_TIME.second'),)).withColumn('date_type', F.col('raw_string_date').cast('timestamp')).show(10, 假))# +------------------------------------+----------------+--------------------+# |START_TIME |raw_string_date|date_type |# +------------------------------------+----------------+--------------------+# |{1, 2, 3, 4, 5, 2021} |2021-4-1 2:3:5 |2021-04-01 02:03:05|# |{, , , , , } |-- :: |null |# |{null, null, null, null, null, null}|null |null |# +------------------------------------+----------------+--------------------+I'm Trying to convert the nested Fields into one Field of DATETIME type when i use explode function i get an error : cannot resolve 'explode(START_Time)' due to data type mismatchdata i have : |-- MODEL: string (nullable = true) |-- START_Time: struct (nullable = true) | |-- day: string (nullable = true) | |-- hour: string (nullable = true) | |-- minute: string (nullable = true) | |-- month: string (nullable = true) | |-- second: string (nullable = true) | |-- year: string (nullable = true) |-- WEIGHT: string (nullable = true) |-- REGISTRED: struct (nullable = true) | |-- day: string (nullable = true) | |-- hour: string (nullable = true) | |-- minute: string (nullable = true) | |-- month: string (nullable = true) | |-- second: string (nullable = true) | |-- year: string (nullable = true) |-- TOTAL: string (nullable = true)Result i'm looking to have :with START_TIME and REGISTRED as DATE type+---------+------------------+----------+-----------------+---------+|MODEL | START_Time | WEIGHT |REGISTED |TOTAL |+---------+------------------+----------+-----------------+---------+|.........| yy-mm-dd-hh-mm-ss| WEIGHT |yy-mm-dd-hh-mm-ss|TOTAL |i have tried :df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), .....)but when there are empty values in the nested fiels it gets (-----) inand it gets me :+---------+------------------+----------+-----------------+---------+|MODEL | START_Time | WEIGHT |REGISTED |TOTAL |+---------+------------------+----------+-----------------+---------+|value | ----- | value | ----- |value | 解决方案 After concatenating, you can just cast the entire column to timestamp type, Spark will handle the missing (and invalid) data for you and return null insteadfrom pyspark.sql import functions as F(df .withColumn('raw_string_date', F .concat( F.col('START_TIME.year'), F.lit('-'), F.col('START_TIME.month'), F.lit('-'), F.col('START_TIME.day'), F.lit(' '), F.col('START_TIME.hour'), F.lit(':'), F.col('START_TIME.minute'), F.lit(':'), F.col('START_TIME.second'), ) ) .withColumn('date_type', F.col('raw_string_date').cast('timestamp')) .show(10, False))# +------------------------------------+---------------+-------------------+# |START_TIME |raw_string_date|date_type |# +------------------------------------+---------------+-------------------+# |{1, 2, 3, 4, 5, 2021} |2021-4-1 2:3:5 |2021-04-01 02:03:05|# |{, , , , , } |-- :: |null |# |{null, null, null, null, null, null}|null |null |# +------------------------------------+---------------+-------------------+ 这篇关于将(年、月、日、小时、分钟、秒)的嵌套值分解为 Pyspark 数据框中一个字段中的日期时间类型的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！