问题描述
我找不到关于Apache Parquet文件的任何简单的英文解释。如:
I couldn't find any plain English explanations regarding Apache Parquet files. Such as:
- 它们是什么?
- 我是否需要Hadoop或HDFS才能查看/创建/存储它们?
- 如何创建镶木地板文件?
- 如何查看镶木地板文件?
- What are they?
- Do I need Hadoop or HDFS to view/create/store them?
- How can I create parquet files?
- How can I view parquet files?
对于这些问题的任何帮助表示赞赏。
Any help regarding these questions is appreciated.
推荐答案
什么是Apache Parquet ?
Apache Parquet是一种二进制文件格式,以柱状方式存储数据。
Parquet文件中的数据类似于RDBMS样式表,其中有列和行。
What is Apache Parquet?
Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows.
Apache Parquet是现代大数据存储格式之一。它有几个优点,其中一些是:
Apache Parquet is one of the modern big data storage formats. It has several advantages, some of which are:
- 列式存储:高效的数据检索,高效压缩等...
- 元数据位于文件的末尾:允许从数据流生成Parquet文件。 (在大数据场景中很常见)
- 所有Apache大数据产品都支持
- Columnar storage: efficient data retrieval, efficient compression, etc...
- Metadata is at the end of the file: allows Parquet files to be generated from a stream of data. (common in big data scenarios)
- Supported by all Apache big data products
没有。 Parquet文件可以存储在任何文件系统中,而不仅仅是HDFS。如上所述,它是一种文件格式。所以它就像任何其他文件一样,它有一个名称和 .parquet 扩展名。然而,在大数据环境中通常会发生的事情是,将一个数据集拆分(或分区)为多个镶木地板文件,以提高效率。
No. Parquet files can be stored in any file system, not just HDFS. As mentioned above it is a file format. So it's just like any other file where it has a name and a .parquet extension. What will usually happen in big data environments though is that one dataset will be split (or partitioned) into multiple parquet files for even more efficiency.
所有Apache大数据产品支持Parquet文件默认情况下。这就是为什么它似乎只能存在于Apache生态系统中。
All Apache big data products support Parquet files by default. So that is why it might seem like it only can exist in the Apache ecosystem.
如上所述,所有当前的Apache大数据产品,如Hadoop,Hive,Spark等,默认都支持Parquet文件。
As mentioned, all current Apache big data products such as Hadoop, Hive, Spark, etc. support Parquet files by default.
所以它可以利用这些系统生成或读取Parquet数据。但这远非实际。想象一下,为了读取或创建CSV文件,您必须安装Hadoop / HDFS + Hive并进行配置。幸运的是还有其他解决方案。
So it's possible to leverage these systems to generate or read Parquet data. But this is far from practical. Imagine that in order to read or create a CSV file you had to install Hadoop/HDFS + Hive and configure them. Luckily there are other solutions.
要创建自己的拼花文件:
- 在Java请看我下面的帖子:的
- 在.NET中,请参阅以下库:
- In Java please see my following post: Generate Parquet File using Java
- In .NET please see the following library: parquet-dotnet
查看拼花文件内容:
- 请尝试以下Windows实用程序:
- Please try the following Windows utility: https://github.com/mukunku/ParquetViewer
还有其他方法吗?
可能。但存在的并不多,而且它们大多没有得到很好的记录。这是因为Parquet是一种非常复杂的文件格式(我甚至找不到正式的定义)。我列出的是我所知道的那些,因为我正在写这个回复
Possibly. But not many exist and they mostly aren't well documented. This is due to Parquet being a very complicated file format (I could not even find a formal definition). The ones I've listed are the only ones I'm aware of as I'm writing this response
这篇关于如何在Windows中查看Apache Parquet文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!