本文介绍了在Spark DataFrame中展平嵌套数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在从中读取一些JSON:
I'm reading in some JSON on the from:
{"a": [{"b": {"c": 1, "d": 2}}]}
也就是说,数组项被不必要地嵌套了.现在,由于这发生在数组内部,因此在如何在Spark数据框中扁平化结构?不能直接应用.
That is, the array items are unnecessarily nested. Now, because this happens inside an array, the answers given in How to flatten a struct in a Spark dataframe? don't apply directly.
这是解析后数据帧的外观:
This is how the dataframe looks when parsed:
root
|-- a: array
| |-- element: struct
| | |-- b: struct
| | | |-- c: integer
| | | |-- d: integer
我正在寻求将数据框转换为此:
I'm looking to transform the dataframe into this:
root
|-- a: array
| |-- element: struct
| | |-- b_c: integer
| | |-- b_d: integer
我该如何对数组中的列进行别名化以有效地取消嵌套呢?
How do I go about aliasing the columns inside the array to effectively unnest it?
推荐答案
您可以使用 transform
:
df2 = df.selectExpr("transform(a, x -> struct(x.b.c as b_c, x.b.d as b_d)) as a")
这篇关于在Spark DataFrame中展平嵌套数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!