本文介绍了如何从Spark 2的DataFrame列中提取子元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出如下所示的DataFrame:

Given the DataFrame like this:

df_products =

df_products =

+----------+--------------------+
|product_PK|            products|
+----------+--------------------+
|       111|[[222,66],[333,55...|
|       222|[[333,24],[444,77...|
...
+----------+--------------------+

如何将其转换为以下DataFrame:

how can I transform it into the following DataFrame:

df_products =

df_products =

+----------+--------------------+------+
|product_PK|      rec_product_PK|  rank|
+----------+--------------------+------+
|       111|                 222|    66|
|       111|                 333|    55|
|       222|                 333|    24|
|       222|                 444|    77|
...
+----------+--------------------+------+

推荐答案

这里基本上有两个步骤:首先是展开数组(使用explode函数)以获取数组中每个值的一行,然后修复每个元素

You basically have two steps here: First is exploding the arrays (using the explode functions) to get a row for each value in the array, then fixing each element.

您这里没有架构,因此数组中每个元素的内部结构都不清晰,但是,我认为它类似于带有两个元素的结构.

You do not have the schema here so the internal structure of each element in the array is not clear, however, I would assume it is something like a struct with two elements.

这意味着您将执行以下操作:

This means you would do something like this:

import org.apache.spark.sql.functions.explode
df1 = df.withColumn("array_elem", explode(df("products"))
df2 = df1.select("product_PK", "array_elem.*")

现在您要做的就是将列重命名为所需的名称.

now all you have to do is rename the columns to the names you need.

这篇关于如何从Spark 2的DataFrame列中提取子元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 08:50