我正在将mongodb加载到配置单元表中,并尝试在saveAsTable时解决不受支持的NullType。
样本数据模式

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: null (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: null (nullable = true)


我尝试过functions.lit喜欢

df = df.withColumn("level1.level2.level3_1", functions.lit("null").cast("string"));
       .withColumn("level1.level2.level3_2.level4", functions.lit("null").cast("string"));


但是结果就像

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: null (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: null (nullable = true)
 |-- level1.level2.level3_1: string (nullable = false)
 |-- level1.level2.level3_2.level4: string (nullable = false)


我还检查了df.na()。fill(),但这似乎并没有改变架构。

理想的结果是

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: string (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: string (nullable = true)


而且我可以使用加载的mongodb数据另存为表格来配置单元

有没有人对此进行过研究,可以给我一些建议,说明如何在Java中转换嵌套的nulltype或如何处理nulltype。考虑可以扩展到更复杂数据的系统/通用解决方案。
非常感谢

最佳答案

一种想法是使用StringType创建架构,并使用该架构读取数据。

StructType schema = createStructType(Arrays.asList(
    createStructField("level1", createStructType(Arrays.asList(
        createStructField("level2", createStructType(Arrays.asList(
            createStructField("level3_1", StringType, true),
            createStructField("level3_2", createArrayType(createStructType(Arrays.asList(
                createStructField("level4", StringType, true)))), true)
            )), true))), true)));

// Replace new ArrayList<>() to your dataset.
Dataset<Row> df = ss.createDataFrame(new ArrayList<>(), schema);
df.printSchema();


root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: string (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: string (nullable = true)




编辑:

我在这里添加了更直观的示例来表达我的想法。希望对您有帮助。

@Test
public void test() {
    SparkSession ss = SparkSession.builder().master("local").appName("test").getOrCreate();

    // Step1) read your mongoDB data. (I added NullType field 'level' manually for explaination.
    // https://docs.mongodb.com/spark-connector/master/python/read-from-mongodb/
    Dataset<Row> data = ss.read().json("test.json").withColumn("level", lit(null));
    data.printSchema();

    StructType schema = createStructType(Arrays.asList(
        createStructField("_id", LongType, true),
        createStructField("level", StringType, true)));

    // Step2) create newData using schema you defined.
    Dataset<Row> newData = ss.createDataFrame(data.collectAsList(), schema);
    newData.printSchema();

    // Step3) load newData to Hive
}

关于java - 如何在Java Spark中转换嵌套结构(不支持的NullType),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/61049642/

10-09 08:12
查看更多