我正在将mongodb加载到配置单元表中,并尝试在saveAsTable时解决不受支持的NullType。
样本数据模式
root
|-- level1: struct (nullable = true)
| |-- level2: struct (nullable = true)
| | |-- level3_1: null (nullable = true)
| | |-- level3_2: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- level4: null (nullable = true)
我尝试过functions.lit喜欢
df = df.withColumn("level1.level2.level3_1", functions.lit("null").cast("string"));
.withColumn("level1.level2.level3_2.level4", functions.lit("null").cast("string"));
但是结果就像
root
|-- level1: struct (nullable = true)
| |-- level2: struct (nullable = true)
| | |-- level3_1: null (nullable = true)
| | |-- level3_2: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- level4: null (nullable = true)
|-- level1.level2.level3_1: string (nullable = false)
|-- level1.level2.level3_2.level4: string (nullable = false)
我还检查了df.na()。fill(),但这似乎并没有改变架构。
理想的结果是
root
|-- level1: struct (nullable = true)
| |-- level2: struct (nullable = true)
| | |-- level3_1: string (nullable = true)
| | |-- level3_2: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- level4: string (nullable = true)
而且我可以使用加载的mongodb数据另存为表格来配置单元
有没有人对此进行过研究,可以给我一些建议,说明如何在Java中转换嵌套的nulltype或如何处理nulltype。考虑可以扩展到更复杂数据的系统/通用解决方案。
非常感谢
最佳答案
一种想法是使用StringType创建架构,并使用该架构读取数据。
StructType schema = createStructType(Arrays.asList(
createStructField("level1", createStructType(Arrays.asList(
createStructField("level2", createStructType(Arrays.asList(
createStructField("level3_1", StringType, true),
createStructField("level3_2", createArrayType(createStructType(Arrays.asList(
createStructField("level4", StringType, true)))), true)
)), true))), true)));
// Replace new ArrayList<>() to your dataset.
Dataset<Row> df = ss.createDataFrame(new ArrayList<>(), schema);
df.printSchema();
root
|-- level1: struct (nullable = true)
| |-- level2: struct (nullable = true)
| | |-- level3_1: string (nullable = true)
| | |-- level3_2: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- level4: string (nullable = true)
编辑:
我在这里添加了更直观的示例来表达我的想法。希望对您有帮助。
@Test
public void test() {
SparkSession ss = SparkSession.builder().master("local").appName("test").getOrCreate();
// Step1) read your mongoDB data. (I added NullType field 'level' manually for explaination.
// https://docs.mongodb.com/spark-connector/master/python/read-from-mongodb/
Dataset<Row> data = ss.read().json("test.json").withColumn("level", lit(null));
data.printSchema();
StructType schema = createStructType(Arrays.asList(
createStructField("_id", LongType, true),
createStructField("level", StringType, true)));
// Step2) create newData using schema you defined.
Dataset<Row> newData = ss.createDataFrame(data.collectAsList(), schema);
newData.printSchema();
// Step3) load newData to Hive
}
关于java - 如何在Java Spark中转换嵌套结构(不支持的NullType),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/61049642/