本文介绍了如何在Windows中查看Apache Parquet文件？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我找不到关于Apache Parquet文件的任何简单的英文解释。如：

I couldn't find any plain English explanations regarding Apache Parquet files. Such as:

它们是什么？

我是否需要Hadoop或HDFS才能查看/创建/存储它们？

如何创建镶木地板文件？

如何查看镶木地板文件？

What are they?
Do I need Hadoop or HDFS to view/create/store them?
How can I create parquet files?
How can I view parquet files?

对于这些问题的任何帮助表示赞赏。

Any help regarding these questions is appreciated.

什么是Apache Parquet ？

Apache Parquet是一种二进制文件格式，以柱状方式存储数据。
Parquet文件中的数据类似于RDBMS样式表，其中有列和行。

What is Apache Parquet?

Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows.

Apache Parquet是现代大数据存储格式之一。它有几个优点，其中一些是：

Apache Parquet is one of the modern big data storage formats. It has several advantages, some of which are:

列式存储：高效的数据检索，高效压缩等...

元数据位于文件的末尾：允许从数据流生成Parquet文件。（在大数据场景中很常见）

所有Apache大数据产品都支持

Columnar storage: efficient data retrieval, efficient compression, etc...
Metadata is at the end of the file: allows Parquet files to be generated from a stream of data. (common in big data scenarios)
Supported by all Apache big data products

没有。 Parquet文件可以存储在任何文件系统中，而不仅仅是HDFS。如上所述，它是一种文件格式。所以它就像任何其他文件一样，它有一个名称和 .parquet 扩展名。然而，在大数据环境中通常会发生的事情是，将一个数据集拆分（或分区）为多个镶木地板文件，以提高效率。

No. Parquet files can be stored in any file system, not just HDFS. As mentioned above it is a file format. So it's just like any other file where it has a name and a .parquet extension. What will usually happen in big data environments though is that one dataset will be split (or partitioned) into multiple parquet files for even more efficiency.

所有Apache大数据产品支持Parquet文件默认情况下。这就是为什么它似乎只能存在于Apache生态系统中。

All Apache big data products support Parquet files by default. So that is why it might seem like it only can exist in the Apache ecosystem.

如上所述，所有当前的Apache大数据产品，如Hadoop，Hive，Spark等，默认都支持Parquet文件。

As mentioned, all current Apache big data products such as Hadoop, Hive, Spark, etc. support Parquet files by default.

所以它可以利用这些系统生成或读取Parquet数据。但这远非实际。想象一下，为了读取或创建CSV文件，您必须安装Hadoop / HDFS + Hive并进行配置。幸运的是还有其他解决方案。

So it's possible to leverage these systems to generate or read Parquet data. But this is far from practical. Imagine that in order to read or create a CSV file you had to install Hadoop/HDFS + Hive and configure them. Luckily there are other solutions.

要创建自己的拼花文件：

在Java请看我下面的帖子：的

在.NET中，请参阅以下库：

In Java please see my following post: Generate Parquet File using Java
In .NET please see the following library: parquet-dotnet

查看拼花文件内容：

请尝试以下Windows实用程序：

Please try the following Windows utility: https://github.com/mukunku/ParquetViewer

还有其他方法吗？

可能。但存在的并不多，而且它们大多没有得到很好的记录。这是因为Parquet是一种非常复杂的文件格式（我甚至找不到正式的定义）。我列出的是我所知道的那些，因为我正在写这个回复

Possibly. But not many exist and they mostly aren't well documented. This is due to Parquet being a very complicated file format (I could not even find a formal definition). The ones I've listed are the only ones I'm aware of as I'm writing this response

这篇关于如何在Windows中查看Apache Parquet文件？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Parquet

如何在Windows中查看Apache Parquet文件？

问题描述

推荐答案

什么是Apache Parquet ？

What is Apache Parquet?