DataFrame中展平嵌套数组

DataFrame中展平嵌套数组

本文介绍了在Spark DataFrame中展平嵌套数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从中读取一些JSON:

I'm reading in some JSON on the from:

{"a": [{"b": {"c": 1, "d": 2}}]}

也就是说,数组项被不必要地嵌套了.现在,由于这发生在数组内部,因此在如何在Spark数据框中扁平化结构?不能直接应用.

That is, the array items are unnecessarily nested. Now, because this happens inside an array, the answers given in How to flatten a struct in a Spark dataframe? don't apply directly.

这是解析后数据帧的外观:

This is how the dataframe looks when parsed:

root
|-- a: array
|    |-- element: struct
|    |    |-- b: struct
|    |    |    |-- c: integer
|    |    |    |-- d: integer

我正在寻求将数据框转换为此:

I'm looking to transform the dataframe into this:

root
|-- a: array
|    |-- element: struct
|    |    |-- b_c: integer
|    |    |-- b_d: integer

我该如何对数组中的列进行别名化以有效地取消嵌套呢?

How do I go about aliasing the columns inside the array to effectively unnest it?

推荐答案

您可以使用 transform :

df2 = df.selectExpr("transform(a, x -> struct(x.b.c as b_c, x.b.d as b_d)) as a")

这篇关于在Spark DataFrame中展平嵌套数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 23:21