查询 Parquet 记录中的嵌套数组

本文介绍了查询 Parquet 记录中的嵌套数组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试不同的方法来查询记录数组中的记录并显示完整的行作为输出.

I am trying different ways to query a record within a array of records and display complete Row as output.

我不知道哪个嵌套对象有字符串pg".但我想查询特定对象.对象是否有pg".如果pg"存在，那么我想显示该完整行.如何在不指定对象索引的情况下在嵌套对象上编写spark sql 查询".所以我不想使用 children.name 的索引

我的 Avro 记录:

My Avro Record:

{
"name": "Parent",
"type":"record",
"fields":[
    {"name": "firstname", "type": "string"},

    {
        "name":"children",
        "type":{
            "type": "array",
            "items":{
                        "name":"child",
                        "type":"record",
                        "fields":[
                            {"name":"name", "type":"string"}
                        ]
                    }
            }
    }
]
}

我正在使用 Spark SQL 上下文来查询读取的数据帧.所以如果输入是

I am using Spark SQL context to query dataframe which is read.So if input is

Row no   Firstname Children.name
    1    John       Max
                    Pg
    2    Bru        huna
                    aman

输出应该返回 poq 1，因为它有一个 children.name 对象是 pg 的行.

Output should return poq 1 since it has row where one object of children.name is pg.

val results = sqlc.sql("SELECT firstname, children.name FROM nestedread where children.name = 'pg'")
results.foreach(x=> println(x(0), x(1).toString))

上述查询不起作用.但它在我查询 children[1].name 时有效.

The above query doesn't work. but it works when i query children[1].name.

我还想知道是否可以过滤一组记录然后爆炸.而不是先爆炸并创建大量行然后过滤.

推荐答案

看来可以用了

org.apache.spark.sql.functions.explode(e: Column): Column

例如在我的项目中(在 Java 中)，我嵌套了这样的 json:

for example in my project(in java), i have nested json like this:

{
    "error": [],
    "trajet": [
        {
            "something": "value"
        }
    ],
    "infos": [
        {
            "something": "value"
        }
    ],
    "timeseries": [
        {
            "something_0": "value_0",
            "something_1": "value_1",
            ...
            "something_n": "value_n"
        }
    ]
}

我想分析时间序列"中的数据，所以我做到了:

and i wanted to analyse datas in "timeseries", so i did:

DataFrame ts = jsonDF.select(org.apache.spark.sql.functions.explode(jsonDF.col("timeseries")).as("t"))
                     .select("t.something_0",
                             "t.something_1",
                             ...
                             "t.something_n");

我也是新手.希望这能给你一个提示.

I'm new to spark too. Hope this could give you a hint.

这篇关于查询 Parquet 记录中的嵌套数组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！