问题描述
我有一个数据框,它有两列(C、D)被定义为字符串列类型,但列中的数据实际上是日期.例如,C 列的日期为01-APR-2015",D 列的日期为20150401",我想将这些更改为日期列类型,但我没有找到这样做的好方法.我查看了堆栈溢出,我需要将字符串列类型转换为 Spark SQL 的 DataFrame 中的日期列类型.日期格式可以是01-APR-2015",我看看 这篇文章 但它没有与日期相关的信息
I have a dataframe that have two columns (C, D) are defined as string column type, but the data in the columns are actually dates. for example column C has the date as "01-APR-2015" and column D as "20150401" I want to change these to date column type, but I didn't find a good way of doing that. I look at the stack overflow I need to convert the string column type to Date column type in Spark SQL's DataFrame. the date format can be "01-APR-2015" and I look at this post but it didn't have info relate to date
推荐答案
Spark >= 2.2
您可以使用to_date
:
import org.apache.spark.sql.functions.{to_date, to_timestamp}
df.select(to_date($"ts", "dd-MMM-yyyy").alias("date"))
或to_timestamp
:
df.select(to_date($"ts", "dd-MMM-yyyy").alias("timestamp"))
使用中间unix_timestamp
调用.
火花
从 Spark 1.5 开始,您可以使用 unix_timestamp
函数将字符串解析为 long,将其转换为时间戳并截断 to_date
:
Since Spark 1.5 you can use unix_timestamp
function to parse string to long, cast it to timestamp and truncate to_date
:
import org.apache.spark.sql.functions.{unix_timestamp, to_date}
val df = Seq((1L, "01-APR-2015")).toDF("id", "ts")
df.select(to_date(unix_timestamp(
$"ts", "dd-MMM-yyyy"
).cast("timestamp")).alias("timestamp"))
注意:
根据 Spark 版本,您可能需要一些调整,因为 SPARK-11724一个>:
Depending on a Spark version you this may require some adjustments due to SPARK-11724:
从整数类型转换为时间戳会将源整数视为以毫秒为单位.从时间戳转换为整数类型会在几秒钟内创建结果.
如果您使用未打补丁的版本 unix_timestamp
输出需要乘以 1000.
If you use unpatched version unix_timestamp
output requires multiplication by 1000.
这篇关于如何在 DataFrames 中将列类型从 String 更改为 Date?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!