从 Pyspark 中包含时间戳的字符串列中提取日期

本文介绍了从 Pyspark 中包含时间戳的字符串列中提取日期的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个日期格式如下的数据框:

I have a dataframe which has a date in the following format:

+----------------------+
|date                  |
+----------------------+
|May 6, 2016 5:59:34 AM|
+----------------------+

我打算以 YYYY-MM-DD 格式从中提取日期；所以结果应该是上述日期 - 2016-05-06.

I intend to extract the date from this in the format YYYY-MM-DD ; so the result should be for the above date - 2016-05-06.

但是当我提取时使用以下内容:

But when I extract is using the following:

df.withColumn('part_date', from_unixtime(unix_timestamp(df.date, "MMM dd, YYYY hh:mm:ss aa"), "yyyy-MM-dd"))

我得到以下日期

2015-12-27

有人可以就此提出建议吗?我不打算将我的 df 转换为 rdd 以使用来自 python 的 datetime 函数，并希望在它自己的数据帧中使用它.

Can anyone please advise on this? I do not intend to convert my df to rdd to use datetime function from python and want to use this in the dataframe it self.

推荐答案

您的模式存在一些错误.这是一个建议:

There are some errors with your pattern. Here's a suggestion:

from_pattern = 'MMM d, yyyy h:mm:ss aa'
to_pattern = 'yyyy-MM-dd'
df.withColumn('part_date', from_unixtime(unix_timestamp(df['date'], from_pattern), to_pattern)).show()

+----------------------+----------+
|date                  |part_date |
+----------------------+----------+
|May 6, 2016 5:59:34 AM|2016-05-06|
+----------------------+----------+

这篇关于从 Pyspark 中包含时间戳的字符串列中提取日期的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！