本文介绍了如何在PySpark中的两个日期之间生成每小时的时间戳?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
考虑此示例数据框
data = [(dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22))]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
+-------------------+-------------------+
| minDate| maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|
+-------------------+-------------------+
我想将这两个日期分解成一个小时的时间序列,例如
I would like to explode those two dates into an hourly time-series like
+-------------------+-------------------+
| minDate| maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 16:00:00|
|2000-01-01 16:01:00|2000-01-01 17:00:00|
|2000-01-01 17:01:00|2000-01-01 18:00:00|
|2000-01-01 18:01:00|2000-01-01 19:00:00|
|2000-01-01 19:01:00|2000-01-01 19:12:22|
+-------------------+-------------------+
您是否对不使用UDF的情况有任何建议?
Do you have any suggestion on how to achieve that without using UDFs?
谢谢
推荐答案
这就是我最终解决它的方式.
This is how I finally solved it.
输入数据
data = [
(dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22)),
(dt.datetime(2001,1,1,15,20,37), dt.datetime(2001,1,1,18,12,22))
]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
结果
+-------------------+-------------------+
| minDate| maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|
|2001-01-01 15:20:37|2001-01-01 18:12:22|
+-------------------+-------------------+
转换后的数据
# Compute hours between min and max date
df = df.withColumn(
'hour_diff',
fn.ceil((fn.col('maxDate').cast('long') - fn.col('minDate').cast('long'))/3600)
)
# Duplicate rows a number of times equal to hour_diff
df = df.withColumn("repeat", fn.expr("split(repeat(',', hour_diff), ',')"))\
.select("*", fn.posexplode("repeat").alias("idx", "val"))\
.drop("repeat", "val")\
.withColumn('hour_add', (fn.col('minDate').cast('long') + fn.col('idx')*3600).cast('timestamp'))
# Create the new start and end date according to the boundaries
df = (df
.withColumn(
'start_dt',
fn.when(
fn.col('idx') > 0,
(fn.floor(fn.col('hour_add').cast('long') / 3600)*3600).cast('timestamp')
).otherwise(fn.col('minDate'))
).withColumn(
'end_dt',
fn.when(
fn.col('idx') != fn.col('hour_diff'),
(fn.ceil(fn.col('hour_add').cast('long') / 3600)*3600-60).cast('timestamp')
).otherwise(fn.col('maxDate'))
).drop('hour_diff', 'idx', 'hour_add'))
df.show()
会导致
+-------------------+-------------------+-------------------+-------------------+
| minDate| maxDate| start_dt| end_dt|
+-------------------+-------------------+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 15:20:37|2000-01-01 15:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 16:00:00|2000-01-01 16:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 17:00:00|2000-01-01 17:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 18:00:00|2000-01-01 18:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 19:00:00|2000-01-01 19:12:22|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 15:20:37|2001-01-01 15:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 16:00:00|2001-01-01 16:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 17:00:00|2001-01-01 17:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 18:00:00|2001-01-01 18:12:22|
+-------------------+-------------------+-------------------+-------------------+
这篇关于如何在PySpark中的两个日期之间生成每小时的时间戳?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!