问题描述
如何遍历 Spark 数据框?我有一个包含以下内容的数据框:
How can I loop through a Spark data frame?I have a data frame that consists of:
time, id, direction
10, 4, True //here 4 enters --> (4,)
20, 5, True //here 5 enters --> (4,5)
34, 5, False //here 5 leaves --> (4,)
67, 6, True //here 6 enters --> (4,6)
78, 6, False //here 6 leaves --> (4,)
99, 4, False //here 4 leaves --> ()
它是按时间排序的,现在我想逐步完成并累积有效的 id.id 在方向==真时进入,在方向==假时退出
it is sorted by time and now I would like to step through and accumulate the valid ids. The ids enter on direction==True and exit on direction==False
所以生成的 RDD 应该是这样的
so the resulting RDD should look like this
time, valid_ids
(10, (4,))
(20, (4,5))
(34, (4,))
(67, (4,6))
(78, (4,)
(99, ())
我知道这不会并行化,但是 df 并没有那么大.那么如何在 Spark/Scala 中做到这一点呢?
I know that this will not parallelize, but the df is not that big. So how could this be done in Spark/Scala?
推荐答案
如果数据很小(但 df 没有那么大"),我只会使用 Scala 集合进行收集和处理.如果类型如下图:
If data is small ("but the df is not that big") I'd just collect and process using Scala collections. If types are as shown below:
df.printSchema
root
|-- time: integer (nullable = false)
|-- id: integer (nullable = false)
|-- direction: boolean (nullable = false)
你可以收集:
val data = df.as[(Int, Int, Boolean)].collect.toSeq
和scanLeft
:
val result = data.scanLeft((-1, Set[Int]())){
case ((_, acc), (time, value, true)) => (time, acc + value)
case ((_, acc), (time, value, false)) => (time, acc - value)
}.tail
这篇关于如何循环遍历 Spark 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!