问题描述
现在有如下的 JSON 数据
now has JSON data as follows
{"Id":11,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000},{"package":"com.browser7","activetime":1205000}]}
{"Id":12,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000}]}
......
这个JSON是app的激活时间,目的是分析每个app的总激活时间
This JSON is the activation time of app, the purpose of which is to analyze the total activation time of each app
我使用 sparK SQL 来解析 JSON
I use sparK SQL to parse JSON
scala
val sqlContext = sc.sqlContext
val behavior = sqlContext.read.json("behavior-json.log")
behavior.cache()
behavior.createOrReplaceTempView("behavior")
val appActiveTime = sqlContext.sql ("SELECT data FROM behavior") // SQL query
appActiveTime.show (100100) // print dataFrame
appActiveTime.rdd.foreach(println) // print RDD
但是打印出来的dataFrame是这样的
But the printed dataFrame is like this
.
+----------------------------------------------------------------------+
| data|
+----------------------------------------------------------------------+
| [[60000, com.browser1], [12870000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1207000, com.browser]]|
| [[120000, com.browser]]|
| [[60000, com.browser1], [1204000, com.browser5]]|
| [[60000, com.browser1], [12075000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1204000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1201000, com.browser]]|
| [[1200400, com.browser5]]|
| [[60000, com.browser1], [1200400, com.browser]]|
|[[60000, com.browser1], [1205000, com.browser6], [1205000, com.browser7]]|
.
RDD是这样的
.
[WrappedArray ([60000, com.browser1], [60000, com.browser1])]
[WrappedArray ([120000, com.browser])]
[WrappedArray ([60000, com.browser1], [1204000, com.browser5])]
[WrappedArray ([12075000, com.browser], [12075000, com.browser])]
.
我想把数据变成
.
Com.browser1 60000
Com.browser1 60000
Com.browser 12075000
Com.browser 12075000
...
.
我想把RDD中每一行的数组元素变成一行.当然,也可以是其他易于分析的结构.
I want to turn the array elements of each line in RDD into one row. Of course, it can be another structure that is easy to analyze.
因为我只学了很多spark和Scala,试了很久都失败了,希望大家多多指教.
Because I only learn spark and Scala a lot, I have try it for a long time but fail, so I hope you can guide me.
推荐答案
从给定的 json
数据中,您可以使用 printSchemadataframe
的架构 并使用它
From your given json
data you can view the schema of your dataframe
with printSchema
and use it
appActiveTime.printSchema()
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- activetime: long (nullable = true)
| | |-- package: string (nullable = true)
因为你有 array
你需要 explode
数据并选择结构字段,如下所示
Since you have array
you need to explode
the data and select the struct field as below
import org.apache.spark.sql.functions._
appActiveTime.withColumn("data", explode($"data"))
.select("data.*")
.show(false)
输出:
+----------+------------+
|activetime| package|
+----------+------------+
| 60000|com.browser1|
| 1205000|com.browser6|
| 1205000|com.browser7|
| 60000|com.browser1|
| 1205000|com.browser6|
+----------+------------+
希望这会有所帮助!
这篇关于如何使用 Spark SQL 解析对象的 JSON 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!