本文介绍了通过 Apache Beam 使用 ParquetIO 读取和写入 Parquet 文件的示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有人尝试过使用 Apache Beam 读取/写入 Parquet 文件.最近在 2.5.0 版本中添加了支持,因此文档不多.

Has anybody tried reading/writing Parquet file using Apache Beam. Support is added recently in version 2.5.0, hence not much documentation.

我正在尝试读取 json 输入文件并希望写入 parquet 格式.

I am trying to read json input file and would like to write to parquet format.

提前致谢.

推荐答案

在不同的模块中添加以下依赖为 ParquetIO.

Add the following dependency as ParquetIO in different module.

<dependency>
    <groupId>org.apache.beam</groupId>;
    <artifactId&gt;beam-sdks-java-io-parquet</artifactId>;
    <version>2.6.0</version>;
</dependency>;

//这里是读写代码....

//Here is code to read and write....

PCollection<JsonObject> input = #Your data
PCollection<GenericRecord> pgr =input.apply("parse json", ParDo.of(new DoFn<JsonObject, GenericRecord> {
        @ProcessElement
        public void processElement(ProcessContext context) {
            JsonObject json= context.getElement();
            GenericRecord record = #convert json to GenericRecord with schema
            context.output(record);
        }
    }));
pgr.apply(FileIO.<GenericRecord>write().via(ParquetIO.sink(schema)).to("path/to/save"));

PCollection<GenericRecord> data = pipeline.apply(
            ParquetIO.read(schema).from("path/to/read"));

这篇关于通过 Apache Beam 使用 ParquetIO 读取和写入 Parquet 文件的示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-24 23:38