本文介绍了重命名Scala Spark Dataframe中的嵌套元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个具有嵌套结构的Spark Scala数据框:
I have a Spark Scala dataframe with a nested structure:
|-- _History: struct (nullable = true)
| |-- Article: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Id: string (nullable = true)
| | | |-- Timestamp: long (nullable = true)
| |-- Channel: struct (nullable = true)
| | |-- <font><font>Cultura pop</font></font>: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- <font><font>Deportes</font></font>: array (nullable = true)
| | | |-- element: long (containsNull = true)
我正在尝试重命名嵌套的元素(例如,将<font><font>Deportes</font></font>
重命名为Deportes
.是否可以使用UDF或类似方法来实现此目的?
I'm trying to rename the nested elements (e.g. <font><font>Deportes</font></font>
to Deportes
. Is there a way to do this using a UDF or something similar?
我尝试了以下操作,但这不起作用:
I've tried the following, which doesn't work:
var filterDF2 = filterDF
.withColumnRenamed("_History.Channel.<font><font>Deportes</font></font>", "_History.Channel.Deportes")
推荐答案
最简单的方法是对正确命名的模式字符串(或等效的StructField
定义)使用类型转换:
The simplest approach is to use type casting with properly named schema string (or equivalent StructField
definition):
val schema = """struct<
Article: array<struct<Id:string,Timestamp:bigint>>,
Channel: struct<Cultura: bigint, Deportes: array<bigint>>>"""
df.withColumn("_History", $"_History".cast(schema))
您还可以使用案例类对此进行建模:
You could also model this with case classes:
import org.apache.spark.sql.Row
case class ChannelRecord(Cultura: Option[Long], Deoprtes: Option[Seq[Long]])
val rename = udf((row: Row) =>
ChannelRecord(Option(row.getLong(0)), Option(row.getSeq[Long](1))))
df.withColumn("_History",
struct($"_History.Article", rename($"_History.channel").alias("channel")))
这篇关于重命名Scala Spark Dataframe中的嵌套元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!