问题描述
我将以上代码作为spark驱动程序,当我执行我的程序时,它可以正常工作,将所需数据保存为镶木地板文件。
I have above code as spark driver, when I execute my program it works properly saving required data as parquet file.
String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
@Override
public String call(String patientId) throws Exception {
return "json array as string"
}
});
//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");
但是我在RDD上观察了我的mapper函数 indexData
正在执行两次。
首先,当我使用 SQLContext jsonStringRdd
作为 DataFrame
$ c>
其次,当我将 dataSchemaDF
写入镶木地板文件时
But i observed my mapper function on RDD indexData
is getting executed twice.first, when I read jsonStringRdd
as DataFrame
using SQLContext
Second, when I write the dataSchemaDF
to the parquet file
你能指导我吗?对此,如何避免这种重复执行?有没有其他更好的方法将json字符串转换为数据帧?
Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting json string into a Dataframe?
推荐答案
我认为原因是缺乏架构JSON读者。当你执行:
I believe that the reason is a lack of schema for JSON reader. When you execute:
sqlContext.read().json(jsonStringRDD);
Spark必须推断新创建的 DataFrame 。要做到这一点,它有扫描输入RDD,这一步骤是急切执行
Spark has to infer schema for a newly created
DataFrame
. To do that it has scan input RDD and this step is performed eagerly
如果你想避免它,你必须创建一个
StructType
描述了JSON文档的形状:
If you want to avoid it you have to create a
StructType
which describes the shape of the JSON documents:
StructType schema;
...
并在创建
DataFrame :
and use it when you create
DataFrame
:
DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);
这篇关于Spark java Map函数正在执行两次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!