问题描述
现在具有以下JSON数据
now has JSON data as follows
{"Id":11,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000},{"package":"com.browser7","activetime":1205000}]}
{"Id":12,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000}]}
......
此JSON是应用程序的激活时间,其目的是分析每个应用程序的总激活时间
This JSON is the activation time of app, the purpose of which is to analyze the total activation time of each app
我使用sparK SQL解析JSON
I use sparK SQL to parse JSON
scala
val sqlContext = sc.sqlContext
val behavior = sqlContext.read.json("behavior-json.log")
behavior.cache()
behavior.createOrReplaceTempView("behavior")
val appActiveTime = sqlContext.sql ("SELECT data FROM behavior") // SQL query
appActiveTime.show (100100) // print dataFrame
appActiveTime.rdd.foreach(println) // print RDD
但是打印的dataFrame是这样的
But the printed dataFrame is like this
.
+----------------------------------------------------------------------+
| data|
+----------------------------------------------------------------------+
| [[60000, com.browser1], [12870000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1207000, com.browser]]|
| [[120000, com.browser]]|
| [[60000, com.browser1], [1204000, com.browser5]]|
| [[60000, com.browser1], [12075000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1204000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1201000, com.browser]]|
| [[1200400, com.browser5]]|
| [[60000, com.browser1], [1200400, com.browser]]|
|[[60000, com.browser1], [1205000, com.browser6], [1205000, com.browser7]]|
.
RDD就是这样
.
[WrappedArray ([60000, com.browser1], [60000, com.browser1])]
[WrappedArray ([120000, com.browser])]
[WrappedArray ([60000, com.browser1], [1204000, com.browser5])]
[WrappedArray ([12075000, com.browser], [12075000, com.browser])]
.
我想将数据转换为
.
Com.browser1 60000
Com.browser1 60000
Com.browser 12075000
Com.browser 12075000
...
.
我想将RDD中每一行的数组元素变成一行.当然,它可以是另一种易于分析的结构.
I want to turn the array elements of each line in RDD into one row. Of course, it can be another structure that is easy to analyze.
因为我只学过很多Spark和Scala,所以我尝试了很长时间但失败了,所以希望您能指导我.
Because I only learn spark and Scala a lot, I have try it for a long time but fail, so I hope you can guide me.
推荐答案
从给定的json
数据中,您可以使用printSchema
查看并使用dataframe
的架构
From your given json
data you can view the schema of your dataframe
with printSchema
and use it
appActiveTime.printSchema()
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- activetime: long (nullable = true)
| | |-- package: string (nullable = true)
由于您拥有array
,因此需要explode
数据并选择如下的struct字段
Since you have array
you need to explode
the data and select the struct field as below
import org.apache.spark.sql.functions._
appActiveTime.withColumn("data", explode($"data"))
.select("data.*")
.show(false)
输出:
+----------+------------+
|activetime| package|
+----------+------------+
| 60000|com.browser1|
| 1205000|com.browser6|
| 1205000|com.browser7|
| 60000|com.browser1|
| 1205000|com.browser6|
+----------+------------+
希望这会有所帮助!
这篇关于如何使用Spark SQL解析对象的JSON数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!