问题描述
我在Scala中有两个数据框:
I have two dataframes in Scala:
df1 =
ID start_date_time
1 2016-10-12 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
和
df2 =
PK start_date
1 2016-10-12
2 2016-10-14
如果满足以下条件,我需要向 df1
添加一个新列,其值为 0
失败,否则 - > 1
:
I need to add a new column to df1
that will have value 0
if the following condition fails, otherwise -> 1
:
结果应该是这样的:
df1 =
ID start_date_time check
1 2016-10-12-11-55-23 1
2 2016-10-12-12-25-00 0
3 2016-10-12-16-20-00 0
我该怎么做?
我假设逻辑应该是这样的:
I assume that the logic should be something like this:
df1 = df.withColumn("check", define(df("ID"),df("start_date")))
val define = udf {(id: String,dateString:String) =>
val formatter = new SimpleDateFormat("yyyy-MM-dd")
val date = formatter.format(dateString)
val checks = df2.filter(df2("PK")===ID).filter(df2("start_date_time")===date)
if(checks.collect().length>0) "1" else "0"
}
但是,我对如何比较日期有疑问,因为 df1
和 df2
具有不同格式的日期。如何更好地实现?
However, I have doubts regarding how to compare dates, because df1
and df2
have differently formatted dates. How to better implement it?
推荐答案
您可以使用在 df1
和 df2
然后在 df1,df2
上进行左连接,这里您创建一个额外的常数列在df2上检查
,以指示结果中是否有匹配:
You can use spark datetime functions to create date columns on both df1
and df2
and then do a left join on df1, df2
, here you create an extra constant column check
on df2 to indicate if there is a match in the result:
import org.apache.spark.sql.functions.lit
val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show
+---+--------------------+-----+
| ID| start_date_time|check|
+---+--------------------+-----+
| 1|2016-10-12 11:55:...| 1|
| 2|2016-10-12 12:25:...| 0|
| 3|2016-10-12 16:20:...| 0|
+---+--------------------+-----+
这篇关于比较数据框中的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!