问题描述
我在 DF
中有一个列,它包含格式 (yyyy-mm-dd HH:mm:ss) 的 timestamp
.我需要将 timestamp
舍入到最接近的 30 秒.
I have a column in a DF
and it contains timestamp
in format ( yyyy-mm-dd HH:mm:ss). I need to round timestamp
to nearest 30 seconds.
old column desired column
2016-02-09 19:31:02 2016-02-09 19:31:00
2016-02-09 19:31:35 2016-02-09 19:31:30
2016-02-09 19:31:52 2016-02-09 19:32:00
2016-02-09 19:31:28 2016-02-09 19:31:30
是否可以在 Pyspark 中做到这一点?
Is it possible to do that in Pyspark ?
推荐答案
如果您使用的是 spark 版本 1.5+,则可以使用 pyspark.sql.functions.second()
从时间戳列中获取秒数.
If you're using spark verson 1.5+, you can use pyspark.sql.functions.second()
to get the seconds from your timestamp column.
import pyspark.sql.functions as f
df.withColumn("second", f.second("old_timestamp")).show()
#+-------------------+------+
#| old_timestamp|second|
#+-------------------+------+
#|2016-02-09 19:31:02| 2|
#|2016-02-09 19:31:35| 35|
#|2016-02-09 19:31:52| 52|
#|2016-02-09 19:31:28| 28|
#+-------------------+------+
一旦你有了秒部分,你就可以把这个数字除以 30,四舍五入,再乘以 30 得到新的"秒.
Once you have the seconds part you can take this number, divide by 30, round it, and multiply by 30 to get the "new" second.
df.withColumn("second", f.second("old_timestamp"))\
.withColumn("new_second", f.round(f.col("second")/30)*30)\
.show()
#+-------------------+------+----------+
#| old_timestamp|second|new_second|
#+-------------------+------+----------+
#|2016-02-09 19:31:02| 2| 0.0|
#|2016-02-09 19:31:35| 35| 30.0|
#|2016-02-09 19:31:52| 52| 60.0|
#|2016-02-09 19:31:28| 28| 30.0|
#+-------------------+------+----------+
从新"秒开始,我们可以计算出一个以秒为单位的偏移量,当添加到原始时间戳时,将产生所需的四舍五入"时间戳.
From the "new" second, we can compute an offset in seconds, which when added to the original timestamp will produce the desired "rounded" timestamps.
df.withColumn("second", f.second("old_timestamp"))\
.withColumn("new_second", f.round(f.col("second")/30)*30)\
.withColumn("add_seconds", f.col("new_second") - f.col("second"))\
.show()
#+-------------------+------+----------+-----------+
#| old_timestamp|second|new_second|add_seconds|
#+-------------------+------+----------+-----------+
#|2016-02-09 19:31:02| 2| 0.0| -2.0|
#|2016-02-09 19:31:35| 35| 30.0| -5.0|
#|2016-02-09 19:31:52| 52| 60.0| 8.0|
#|2016-02-09 19:31:28| 28| 30.0| 2.0|
#+-------------------+------+----------+-----------+
如我们所见,此列中的负数表示必须将原始时间向下舍入.正数会增加时间.
As we can see, a negative number in this column means that the original time has to be rounded down. A positive number will increase the time.
为了将这个时间添加到原始时间戳中,首先使用 pyspark.sql.functions.unix_timestamp()
.添加后,使用 pyspark.sql.functions.from_unixtime()
.
In order to add this time to the original timestamp, first convert it to a unix timestamp using pyspark.sql.functions.unix_timestamp()
. After the addition, convert the result back to a timestamp using pyspark.sql.functions.from_unixtime()
.
将所有这些放在一起(浓缩中间步骤):
Putting this all together (condensing the intermediate steps):
df.withColumn(
"add_seconds",
(f.round(f.second("old_timestamp")/30)*30) - f.second("old_timestamp")
)\
.withColumn(
"new_timestamp",
f.from_unixtime(f.unix_timestamp("old_timestamp") + f.col("add_seconds"))
)\
.drop("add_seconds")\
.show()
#+-------------------+-------------------+
#| old_timestamp| new_timestamp|
#+-------------------+-------------------+
#|2016-02-09 19:31:02|2016-02-09 19:31:00|
#|2016-02-09 19:31:35|2016-02-09 19:31:30|
#|2016-02-09 19:31:52|2016-02-09 19:32:00|
#|2016-02-09 19:31:28|2016-02-09 19:31:30|
#+-------------------+-------------------+
这篇关于将时间戳舍入到最接近的 30 秒的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!