问题描述
开始使用 spark 我想知道如何flatmap
或 explode
数据帧.
getting started with spark I would like to know how to flatmap
or explode
a dataframe.
它是使用 df.groupBy("columName").count
创建的,如果我收集它,它具有以下结构:
It was created using df.groupBy("columName").count
and has the following structure if I collect it:
[[Key1, count], [Key2, count2]]
但我更喜欢像
Map(bar -> 1, foo -> 1, awesome -> 1)
实现此类目标的正确工具是什么?平面图,爆炸还是其他什么?
What is the right tool to achieve something like this? Flatmap, explode or something else?
上下文:我想使用 spark-jobserver.如果我以后一种形式提供数据,它似乎只会提供有意义的结果(例如一个有效的 json 序列化)
Context: I want to use spark-jobserver. It only seems to provide meaningful results (e.g. a working json serialization) in case I supply the data in the latter forml
推荐答案
我假设您在 DataFrame 上调用 collect
或 collectAsList
?这将返回一个 Array[Row]
/List[Row]
.
I'm assuming you're calling collect
or collectAsList
on the DataFrame? That would return an Array[Row]
/ List[Row]
.
如果是这样 - 将这些转换为映射的最简单方法是使用底层 RDD,将其记录映射到键值元组并使用 collectAsMap
:
If so - the easiest way to transform these into maps is to use the underlying RDD, map its recrods into key-value tuples and use collectAsMap
:
def counted = df.groupBy("columName").count()
// obviously, replace "keyColumn" and "valueColumn" with your actual column names
def result = counted.rdd.map(r => (r.getAs[String]("keyColumn"), r.getAs[Long]("valueColumn"))).collectAsMap()
result
符合预期的 Map[String, Long]
类型.
result
has type Map[String, Long]
as expected.
这篇关于Spark 扁平化数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!