问题描述
我在寻找一个允许使用Python编写Parquet文件的库时遇到了麻烦.如果可以结合使用Snappy或类似的压缩机制,则可以加分.
I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it.
到目前为止,我发现的唯一方法是将Spark与pyspark.sql.DataFrame
Parquet支持一起使用.
Thus far the only method I have found is using Spark with the pyspark.sql.DataFrame
Parquet support.
我有一些脚本需要编写不是Spark作业的Parquet文件.有没有什么方法可以在Python中编写不涉及pyspark.sql
的Parquet文件?
I have some scripts that need to write Parquet files that are not Spark jobs. Is there any approach to writing Parquet files in Python that doesn't involve pyspark.sql
?
推荐答案
更新(2017年3月):当前有 2 个库可以进行编写木地板文件:
Update (March 2017): There are currently 2 libraries capable of writing Parquet files:
- fastparquet
- pyarrow
它们似乎仍处于开发阶段,并且带有许多免责声明(例如,不支持嵌套数据),因此您必须检查它们是否支持所需的一切.
Both of them are still under heavy development it seems and they come with a number of disclaimers (no support for nested data e.g.), so you will have to check whether they support everything you need.
旧答案:
自2016年2月2日起,似乎还没有能够写入 Parquet文件的纯Python库.
As of 2.2016 there seems to be NO python-only library capable of writing Parquet files.
如果您只需要读取 Parquet文件,则有 python-parquet .
If you only need to read Parquet files there is python-parquet.
作为解决方法,您将不得不依靠其他一些流程,例如pyspark.sql
(它使用Py4J并在JVM上运行,因此不能直接在普通的CPython程序中使用).
As a workaround you will have to rely on some other process like e.g. pyspark.sql
(which uses Py4J and runs on the JVM and can thus not be used directly from your average CPython program).
这篇关于使用Python编写Parquet文件的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!