本文介绍了如何在Windows中查看Apache Parquet文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找不到关于Apache Parquet文件的任何简单的英文解释。如:

I couldn't find any plain English explanations regarding Apache Parquet files. Such as:


  1. 它们是什么?

  2. 我是否需要Hadoop或HDFS才能查看/创建/存储它们?

  3. 如何创建镶木地板文件?

  4. 如何查看镶木地板文件?

  1. What are they?
  2. Do I need Hadoop or HDFS to view/create/store them?
  3. How can I create parquet files?
  4. How can I view parquet files?

对于这些问题的任何帮助表示赞赏。

Any help regarding these questions is appreciated.

推荐答案

什么是Apache Parquet ?



Apache Parquet是一种二进制文件格式,以柱状方式存储数据。
Parquet文件中的数据类似于RDBMS样式表,其中有列和行。

What is Apache Parquet?

Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows.

Apache Parquet是现代大数据存储格式之一。它有几个优点,其中一些是:

Apache Parquet is one of the modern big data storage formats. It has several advantages, some of which are:


  • 列式存储:高效的数据检索,高效压缩等...

  • 元数据位于文件的末尾:允许从数据流生成Parquet文件。 (在大数据场景中很常见)

  • 所有Apache大数据产品都支持

  • Columnar storage: efficient data retrieval, efficient compression, etc...
  • Metadata is at the end of the file: allows Parquet files to be generated from a stream of data. (common in big data scenarios)
  • Supported by all Apache big data products

没有。 Parquet文件可以存储在任何文件系统中,而不仅仅是HDFS。如上所述,它是一种文件格式。所以它就像任何其他文件一样,它有一个名称和 .parquet 扩展名。然而,在大数据环境中通常会发生的事情是,将一个数据集拆分(或分区)为多个镶木地板文件,以提高效率。

No. Parquet files can be stored in any file system, not just HDFS. As mentioned above it is a file format. So it's just like any other file where it has a name and a .parquet extension. What will usually happen in big data environments though is that one dataset will be split (or partitioned) into multiple parquet files for even more efficiency.

所有Apache大数据产品支持Parquet文件默认情况下。这就是为什么它似乎只能存在于Apache生态系统中。

All Apache big data products support Parquet files by default. So that is why it might seem like it only can exist in the Apache ecosystem.

如上所述,所有当前的Apache大数据产品,如Hadoop,Hive,Spark等,默认都支持Parquet文件。

As mentioned, all current Apache big data products such as Hadoop, Hive, Spark, etc. support Parquet files by default.

所以它可以利用这些系统生成或读取Parquet数据。但这远非实际。想象一下,为了读取或创建CSV文件,您必须安装Hadoop / HDFS + Hive并进行配置。幸运的是还有其他解决方案。

So it's possible to leverage these systems to generate or read Parquet data. But this is far from practical. Imagine that in order to read or create a CSV file you had to install Hadoop/HDFS + Hive and configure them. Luckily there are other solutions.

要创建自己的拼花文件:


  • 在Java请看我下面的帖子:的

  • 在.NET中,请参阅以下库:

  • In Java please see my following post: Generate Parquet File using Java
  • In .NET please see the following library: parquet-dotnet

查看拼花文件内容:


  • 请尝试以下Windows实用程序:

  • Please try the following Windows utility: https://github.com/mukunku/ParquetViewer

还有其他方法吗?

可能。但存在的并不多,而且它们大多没有得到很好的记录。这是因为Parquet是一种非常复杂的文件格式(我甚至找不到正式的定义)。我列出的是我所知道的那些,因为我正在写这个回复

Possibly. But not many exist and they mostly aren't well documented. This is due to Parquet being a very complicated file format (I could not even find a formal definition). The ones I've listed are the only ones I'm aware of as I'm writing this response

这篇关于如何在Windows中查看Apache Parquet文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-29 08:15