问题描述
我在spark中具有以下数据框:
I have the following dataframe in spark:
val test = sqlContext.read.json(path = "/path/to/jsonfiles/*")
test.printSchema
root
|-- properties: struct (nullable = true)
| |-- prop_1: string (nullable = true)
| |-- prop_2: string (nullable = true)
| |-- prop_3: boolean (nullable = true)
| |-- prop_4: long (nullable = true)
...
我想做的是将此数据框展平,以使prop_1 ... prop_n
存在于顶层.即
What I would like to do is flatten this dataframe so that the prop_1 ... prop_n
exist at the top level. I.e.
test.printSchema
root
|-- prop_1: string (nullable = true)
|-- prop_2: string (nullable = true)
|-- prop_3: boolean (nullable = true)
|-- prop_4: long (nullable = true)
...
有几种解决类似问题的方法.我能找到的最好的是此处.但是,仅当properties
的类型为Array
时,解决方案才有效.就我而言,属性的类型为StructType
.
There are several solutions to similar problems. The best I can find is posed here. However, solution only works if properties
is of type Array
. In my case, properties is of type StructType
.
另一种方法是:
test.registerTempTable("test")
val test2 = sqlContext.sql("""SELECT properties.prop_1, ... FROM test""")
但是在这种情况下,我必须明确指定每一行,这很不雅致.
But in this case I have to explicitly specify each row, and that is inelegant.
解决此问题的最佳方法是什么?
What is the best way to solve this problem?
推荐答案
如果您不寻求递归解决方案,那么在带有星号的1.6+点语法中应该可以正常工作:
If you're not looking for a recursive solution then in 1.6+ dot syntax with star should work just fine:
val df = sqlContext.read.json(sc.parallelize(Seq(
"""{"properties": {
"prop1": "foo", "prop2": "bar", "prop3": true, "prop4": 1}}"""
)))
df.select($"properties.*").printSchema
// root
// |-- prop1: string (nullable = true)
// |-- prop2: string (nullable = true)
// |-- prop3: boolean (nullable = true)
// |-- prop4: long (nullable = true)
不幸的是,这在1.5及更低版本中不起作用.
Unfortunately this doesn't work in 1.5 and before.
在这种情况下,您可以直接从架构中直接提取所需的信息.您可以在从Spark DataFrame中删除嵌套列中找到一个示例,该示例应易于调整以适合这种情况,而另一个示例一个(使用Python进行递归模式展平) Pyspark:将SchemaRDD映射到SchemaRDD .
In case like this you can simply extract required information directly from the schema. You'll find one example in Dropping a nested column from Spark DataFrame which should be easy to adjust to fit this scenario and another one (recursive schema flattening in Python) Pyspark: Map a SchemaRDD into a SchemaRDD.
这篇关于优雅的Json在Spark中变平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!