本文介绍了修改火花数据框中的结构列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个 pyspark 数据框,其中包含一个学生"列,如下所示:
学生":{"name" : "kaleem",rollno":12"}
数据框中的架构是:
structType(List(名称:字符串,rollno: 字符串))
我需要将此列修改为
学生":{学生详细信息":{"name" : "kaleem",rollno":12"}}
数据帧中的架构必须是:
structType(List(student_details:结构类型(列表(名称:字符串,rollno: 字符串))))
如何在火花中做到这一点?
解决方案
使用 named_struct 函数来实现这个-
1.将json读取为列
val data ="""|{|学生": {|"name": "kaleem",|"rollno": "12"|}|}""".stripMarginval df = spark.read.json(Seq(data).toDS())df.show(false)println(df.schema("学生"))
输出-
+------------+|学生|+------------+|[卡利姆, 12]|+------------+StructField(student,StructType(StructField(name,StringType,true),StructField(rollno,StringType,true)),true)
2.使用 named_struct
更改架构val processingDf = df.withColumn("student",expr("named_struct('student_details', student)"))已处理Df.show(假)println(processedDf.schema("student"))
输出-
+--------------+|学生|+--------------+|[[卡利姆, 12]]|+--------------+StructField(student,StructType(StructField(student_details,StructType(StructField(name,StringType,true),StructField(rollno,StringType,true)),true)),false)
对于 python step#2
将正常工作,只是删除 val
I have a pyspark dataframe which contains a column "student" as follows:
"student" : {
"name" : "kaleem",
"rollno" : "12"
}
Schema for this in dataframe is :
structType(List(
name: String,
rollno: String))
I need to modify this column as
"student" : {
"student_details" : {
"name" : "kaleem",
"rollno" : "12"
}
}
Schema for this in dataframe must be :
structType(List(
student_details:
structType(List(
name: String,
rollno: String))
))
How to do this in spark?
解决方案
Use named_struct function to achieve this-
1. Read the json as column
val data =
"""
| {
| "student": {
| "name": "kaleem",
| "rollno": "12"
| }
|}
""".stripMargin
val df = spark.read.json(Seq(data).toDS())
df.show(false)
println(df.schema("student"))
Output-
+------------+
|student |
+------------+
|[kaleem, 12]|
+------------+
StructField(student,StructType(StructField(name,StringType,true), StructField(rollno,StringType,true)),true)
2. change the schema using named_struct
val processedDf = df.withColumn("student",
expr("named_struct('student_details', student)")
)
processedDf.show(false)
println(processedDf.schema("student"))
Output-
+--------------+
|student |
+--------------+
|[[kaleem, 12]]|
+--------------+
StructField(student,StructType(StructField(student_details,StructType(StructField(name,StringType,true), StructField(rollno,StringType,true)),true)),false)
这篇关于修改火花数据框中的结构列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!