问题描述
我有一个存储在HDFS中的文件,其格式为part-m-00000.gz.parquet
I have a file stored in HDFS as part-m-00000.gz.parquet
我尝试运行hdfs dfs -text dir/part-m-00000.gz.parquet
,但是它已压缩,所以我运行了gunzip part-m-00000.gz.parquet
,但是由于无法识别.parquet
扩展名,因此它没有解压缩文件.
I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet
but it's compressed, so I ran gunzip part-m-00000.gz.parquet
but it doesn't uncompress the file since it doesn't recognise the .parquet
extension.
如何获取此文件的架构/列名称?
How do I get the schema / column names for this file?
推荐答案
您将无法使用hdfs dfs -text打开"文件,因为它不是文本文件.与文本文件相比,Parquet文件写入磁盘的方式非常不同.
You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files.
同样,Parquet项目提供了镶木地板工具来执行您要执行的任务.打开并查看架构,数据,元数据等.
And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc.
检出parquet-tool项目(简单地说就是jar文件).镶木工具
Check out the parquet-tool project (which is put simply, a jar file.)parquet-tools
支持Parquet并为Parquet做出巨大贡献的Cloudera,也有一个漂亮的页面,其中包含有关Parquet工具用法的示例.该页面上针对您的用例的一个示例是
Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is
parquet-tools schema part-m-00000.parquet
签出Cloudera页面. 将Parquet文件格式与Impala,Hive结合使用,Pig,HBase和MapReduce
Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce
这篇关于如何从实木复合地板文件中获取架构/列名称?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!