本文介绍了Spark java Map函数正在执行两次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!



I have above code as spark driver, when I execute my program it works properly saving required data as parquet file.

      String indexFile = "index.txt";
      JavaRDD<String> indexData = sc.textFile(indexFile).cache();
      JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
        public String call(String patientId) throws Exception {
         return "json array as string"

//1. Read json string array into a Dataframe (execution 1)
       DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)

但是我在RDD上观察了我的mapper函数 indexData 正在执行两次。
首先,当我使用 SQLContext jsonStringRdd 作为 DataFrame $ c>
其次,当我将 dataSchemaDF 写入镶木地板文件时

But i observed my mapper function on RDD indexData is getting executed twice.first, when I read jsonStringRdd as DataFrame using SQLContextSecond, when I write the dataSchemaDF to the parquet file


Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting json string into a Dataframe?



I believe that the reason is a lack of schema for JSON reader. When you execute:


Spark必须推断新创建的 DataFrame 。要做到这一点,它有扫描输入RDD,这一步骤是急切执行

Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly

如果你想避免它,你必须创建一个 StructType 描述了JSON文档的形状:

If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:

StructType schema;

并在创建 DataFrame :

and use it when you create DataFrame:

DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);

这篇关于Spark java Map函数正在执行两次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 14:37