本文介绍了如何在pyspark中以纳秒为单位将字符串转换为时间戳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在处理时间戳包含纳秒的数据,并试图将字符串转换为时间戳格式.
I am working with data with timestamps that contain nanoseconds and am trying to convert the string to timestamp format.
时间"列如下所示:
+---------------+
| Time |
+---------------+
|091940731349000|
|092955002327000|
|092955004088000|
+---------------+
我想将其转换为:
+------------------+
| Timestamp |
+------------------+
|09:19:40.731349000|
|09:29:55.002327000|
|09:29:55.004088000|
+------------------+
根据我在网上找到的内容,我不需要使用 udf 来执行此操作,并且应该有一个可以使用的本机函数.
From what I have found online, I don't need to use a udf to do this and there should be a native function that I can use.
我尝试了 cast
和 to_timestamp
但得到了 'null' 值:
I have tried cast
and to_timestamp
but got 'null' values:
df_new = df.withColumn('Timestamp', df.Time.cast("timestamp"))
df_new.select('Timestamp').show()
+---------+
|Timestamp|
+---------+
| null|
| null|
+---------+
推荐答案
你的代码有两个问题:
- 输入不是有效的时间戳表示.
- Spark 不提供可以表示没有日期组件的时间的类型
最接近所需输出的是将输入转换为符合 JDBC 的 java.sql.Timestamp
格式:
The closest you can get to the required output is to convert input to JDBC compliant java.sql.Timestamp
format:
from pyspark.sql.functions import col, regexp_replace
df = spark.createDataFrame(
["091940731349000", "092955002327000", "092955004088000"],
"string"
).toDF("time")
df.select(regexp_replace(
col("time"),
"^(\\d{2})(\\d{2})(\\d{2})(\\d{9}).*",
"1970-01-01 $1:$2:$3.$4"
).cast("timestamp").alias("time")).show(truncate = False)
# +--------------------------+
# |time |
# +--------------------------+
# |1970-01-01 09:19:40.731349|
# |1970-01-01 09:29:55.002327|
# |1970-01-01 09:29:55.004088|
# +--------------------------+
如果你只想要一个字符串跳过转换并将输出限制为:
If you want just a string skip cast and limit output to:
df.select(regexp_replace(
col("time"),
"^(\\d{2})(\\d{2})(\\d{2})(\\d{9}).*",
"$1:$2:$3.$4"
).alias("time")).show(truncate = False)
# +------------------+
# |time |
# +------------------+
# |09:19:40.731349000|
# |09:29:55.002327000|
# |09:29:55.004088000|
# +------------------+
这篇关于如何在pyspark中以纳秒为单位将字符串转换为时间戳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!