本文介绍了为什么我从 date_format() PySpark 函数得到空结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设有一个日期框架,其中一列由日期作为字符串组成.对于该假设,我们创建以下 dataFrame 作为示例:

Suppose there is a dateframe with a column comprised of dates as strings. For that assumption, we create the following dataFrame as an example:

# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField, DoubleType, FloatType, DateType
from pyspark.sql.functions  import date_format
import random
import time

def strTimeProp(start, end, format, prop):
    stime = time.mktime(time.strptime(start, format)) # Parse a string representing a time according to a format
    etime = time.mktime(time.strptime(end, format))
    ptime = stime + prop * (etime - stime)
    return time.strftime(format, time.localtime(ptime))

def randomDate(start, end, prop):
    return strTimeProp(start, end, '%m-%d-%Y', prop)

# Creación de un dataframe de prueba:
schema = StructType(
        [
     StructField("dates1", StringType(), True),
         StructField("dates2",  StringType(), True)
    ]
)

size = 32
numCol1 = [str(randomDate("1-1-1991", "1-1-1992", random.random())) for number in range(size)]
numCol2 = [str(randomDate("1-1-1991", "1-1-1992", random.random())) for number in range(size)]
# Building dataFrame:
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(list(zip(numCol1, numCol2)),schema=schema)
df.show(5)

在上面的代码中,生成了一个随机日期列,这里是一个例子:

In the code above, a random date column is generated, here is an example:

+----------+----------+
|    dates1|    dates2|
+----------+----------+
|12-21-1991|05-30-1991|
|05-28-1991|01-23-1991|
|03-01-1991|08-05-1991|
|07-15-1991|05-13-1991|
|07-21-1991|11-10-1991|
+----------+----------+

我想要做的是使用以下代码(来自 pySpark 文档)更改日期格式:

What I am trying to do is to change date format with the following code (from pySpark documentation):

# Changing date formats:
df.select(date_format('dates1', 'MM-dd-yyy').alias('newFormat')).show(5)

但我得到了这个糟糕的结果:

But I get this bad result:

+---------+
|newFormat|
+---------+
|     null|
|     null|
|     null|
|     null|
|     null|
+---------+

我认为字符串 dataType 存在问题,但同时,我不明白为什么下面的代码有效而上面的代码无效.

I suppose there is a problem relate with the string dataType but at same time, I don't understand why this code bellow works and the code above don't.

fechas = ['1000-01-01', '1000-01-15']
df = sqlContext.createDataFrame(list(zip(fechas, fechas)), ['dates', 'd'])
df.show()

# Changing date formats:
df.select(date_format('dates', 'MM-dd-yyy').alias('newFormat')).show()

输出:

+----------+----------+
|     dates|         d|
+----------+----------+
|1000-01-01|1000-01-01|
|1000-01-15|1000-01-15|
+----------+----------+

+----------+
| newFormat|
+----------+
|01-01-1000|
|01-15-1000|
+----------+

这最后的结果正是我想要的.

This last results is what I want.

推荐答案

它不起作用,因为您的数据不是有效的 ISO 8601 表示形式并且转换到日期返回 NULL:

It doesn't work because your data is not a valid ISO 8601 representation and cast to date returns NULL:

sqlContext.sql("SELECT CAST('12-21-1991' AS DATE)").show()
## +----+
## | _c0|
## +----+
## |null|
## +----+

您必须首先使用自定义格式解析数据:

You'll have to parse data first using custom format:

output_format = ...  # Some SimpleDateFormat string
df.select(date_format(
    unix_timestamp("dates1", "MM-dd-yyyy").cast("timestamp"),
    output_format
))

这篇关于为什么我从 date_format() PySpark 函数得到空结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 08:29