本文介绍了通过减去两个字符串格式的日期时间列来计算持续时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含一系列日期的 Spark 数据框:
from pyspark.sql import SQLContext从 pyspark.sql 导入行从 pyspark.sql.types 导入 *sqlContext = SQLContext(sc)将熊猫导入为 pdrdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'),('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321','sip:6413445440'),('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229','sip:4534437492'),('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881','sip:6474454453'),('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323','sip:8874458555')])schema = StructType([StructField('ID', StringType(), True),StructField('EndDateTime', StringType(), True),StructField('StartDateTime', StringType(), True)])df = sqlContext.createDataFrame(rdd, schema)
我想要做的是通过减去EndDateTime
和StartDateTime
来找到duration
.我想我会尝试使用一个函数来做到这一点:
# 计算时间增量的函数def time_delta(y,x):结束 = pd.to_datetime(y)开始 = pd.to_datetime(x)delta =(结束开始)回报增量# 通过应用 time_delta 函数创建新的 RDD 并添加新列持续时间"df2 = df.withColumn('Duration', time_delta(df.EndDateTime, df.StartDateTime))
然而,这只是给了我:
>>>df2.show()ID EndDateTime StartDateTime ANI 持续时间X01 2014-02-13T12:36:... 2014-02-13T12:31:... sip:4534454450 nullX02 2014-02-13T12:35:... 2014-02-13T12:32:... sip:6413445440 nullX03 2014-02-13T12:36:... 2014-02-13T12:32:... sip:4534437492 nullXO4 2014-02-13T12:37:... 2014-02-13T12:32:... sip:6474454453 nullXO5 2014-02-13T12:36:... 2014-02-13T12:33:... sip:8874458555 null我不确定我的方法是否正确.如果没有,我很乐意接受另一种建议的方法来实现这一目标.
解决方案
从 Spark 1.5 开始,您可以使用 unix_timestamp:
from pyspark.sql import 函数为 FtimeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)- F.unix_timestamp('StartDateTime', format=timeFmt))df = df.withColumn("持续时间", timeDiff)
注意 Java 风格的时间格式.
>>>df.show()+---+--------------------+--------------------+--------+|身份证|结束日期时间|开始日期时间|持续时间|+---+--------------------+--------------------+--------+|X01|2014-02-13T12:36:...|2014-02-13T12:31:...|258||X02|2014-02-13T12:35:...|2014-02-13T12:32:...|204||X03|2014-02-13T12:36:...|2014-02-13T12:32:...|第228话|XO4|2014-02-13T12:37:...|2014-02-13T12:32:...|269||XO5|2014-02-13T12:36:...|2014-02-13T12:33:...|202|+---+--------------------+--------------------+--------+I have a Spark Dataframe in that consists of a series of dates:
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
import pandas as pd
rdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'),
('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321','sip:6413445440'),
('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229','sip:4534437492'),
('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881','sip:6474454453'),
('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323','sip:8874458555')])
schema = StructType([StructField('ID', StringType(), True),
StructField('EndDateTime', StringType(), True),
StructField('StartDateTime', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
What I want to do is find duration
by subtracting EndDateTime
and StartDateTime
. I figured I'd try and do this using a function:
# Function to calculate time delta
def time_delta(y,x):
end = pd.to_datetime(y)
start = pd.to_datetime(x)
delta = (end-start)
return delta
# create new RDD and add new column 'Duration' by applying time_delta function
df2 = df.withColumn('Duration', time_delta(df.EndDateTime, df.StartDateTime))
However this just gives me:
>>> df2.show()
ID EndDateTime StartDateTime ANI Duration
X01 2014-02-13T12:36:... 2014-02-13T12:31:... sip:4534454450 null
X02 2014-02-13T12:35:... 2014-02-13T12:32:... sip:6413445440 null
X03 2014-02-13T12:36:... 2014-02-13T12:32:... sip:4534437492 null
XO4 2014-02-13T12:37:... 2014-02-13T12:32:... sip:6474454453 null
XO5 2014-02-13T12:36:... 2014-02-13T12:33:... sip:8874458555 null
I'm not sure if my approach is correct or not. If not, I'd gladly accept another suggested way to achieve this.
解决方案
As of Spark 1.5 you can use unix_timestamp:
from pyspark.sql import functions as F
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)
- F.unix_timestamp('StartDateTime', format=timeFmt))
df = df.withColumn("Duration", timeDiff)
Note the Java style time format.
>>> df.show()
+---+--------------------+--------------------+--------+
| ID| EndDateTime| StartDateTime|Duration|
+---+--------------------+--------------------+--------+
|X01|2014-02-13T12:36:...|2014-02-13T12:31:...| 258|
|X02|2014-02-13T12:35:...|2014-02-13T12:32:...| 204|
|X03|2014-02-13T12:36:...|2014-02-13T12:32:...| 228|
|XO4|2014-02-13T12:37:...|2014-02-13T12:32:...| 269|
|XO5|2014-02-13T12:36:...|2014-02-13T12:33:...| 202|
+---+--------------------+--------------------+--------+
这篇关于通过减去两个字符串格式的日期时间列来计算持续时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!