Spark 扁平化数据帧

本文介绍了Spark 扁平化数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

开始使用 spark 我想知道如何flatmap 或 explode 数据帧.

getting started with spark I would like to know how to flatmap or explode a dataframe.

它是使用 df.groupBy("columName").count 创建的，如果我收集它，它具有以下结构:

It was created using df.groupBy("columName").count and has the following structure if I collect it:

 [[Key1, count], [Key2, count2]]

但我更喜欢像

Map(bar -> 1, foo -> 1, awesome -> 1)

实现此类目标的正确工具是什么?平面图，爆炸还是其他什么?

What is the right tool to achieve something like this? Flatmap, explode or something else?

上下文:我想使用 spark-jobserver.如果我以后一种形式提供数据，它似乎只会提供有意义的结果(例如一个有效的 json 序列化)

Context: I want to use spark-jobserver. It only seems to provide meaningful results (e.g. a working json serialization) in case I supply the data in the latter forml

推荐答案

我假设您在 DataFrame 上调用 collect 或 collectAsList ?这将返回一个 Array[Row]/List[Row].

I'm assuming you're calling collect or collectAsListon the DataFrame? That would return an Array[Row] / List[Row].

如果是这样 - 将这些转换为映射的最简单方法是使用底层 RDD，将其记录映射到键值元组并使用 collectAsMap:

If so - the easiest way to transform these into maps is to use the underlying RDD, map its recrods into key-value tuples and use collectAsMap:

def counted = df.groupBy("columName").count()
// obviously, replace "keyColumn" and "valueColumn" with your actual column names
def result = counted.rdd.map(r => (r.getAs[String]("keyColumn"), r.getAs[Long]("valueColumn"))).collectAsMap()

result 符合预期的 Map[String, Long] 类型.

result has type Map[String, Long] as expected.

这篇关于Spark 扁平化数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！