如何将多个分区的 .gzip 文件读入 Spark Dataframe?

本文介绍了如何将多个分区的 .gzip 文件读入 Spark Dataframe?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下分区数据文件夹-

my_folder|--part-0000.gzip|--part-0001.gzip|--part-0002.gzip|--part-0003.gzip

我尝试使用-

将此数据读入数据帧>>>my_df = spark.read.csv("/path/to/my_folder/*")>>>my_df.show(5)+--------------------+|_c0|+--------------------+| [我 ...|| RUu [*…| t qd 8~ ...|| ( b4 : I ...|| !y ) PC ќ\ ...|+--------------------+只显示前 5 行

也试过用这个来查资料-

>>>rdd = sc.textFile("/path/to/my_folder/*")>>>rdd.take(4)['\x1f \x08\x00\x00\x00\x00\x00\x00\x00 ͎\\ǖ 7 ~ \x04 \x16 \' "b \x04 AR_<G "u \x06 L * 7 J N \' qa \x07\x1ey \x0b\\ \x13\x0f\x0c\x03\x1e Qڏ \x15Y_Yde Y$ Q JY;s \x1d [ \x15k}[B\x01 ˀ PT \x12\x07- \x17\x12 \x0c#\t T۱\x01yf \x14 S\x0bc) \x1ex axAO˓_\' `+HM҈ \x12 \x17 @']

注意:当我执行 zcat part-0000.gzip |head -1 读取文件内容，它显示数据是制表符分隔的，并且是简单易读的英文.

如何将这些文件正确读入数据帧?

解决方案

出于某种原因，Spark 无法识别 .gzip 文件扩展名.所以我不得不在读取分区数据之前更改文件扩展名-

导入操作系统# 转到 my_folderos.chdir("/path/to/my_folder")# 在 my_folder 中将所有 `.gzip` 扩展名重命名为 `.gz`cmd = '重命名s/gzip/gz/"*.gzip'result_code = os.system(cmd)如果 result_code == 0:print("成功重命名文件扩展名！")# 最后将数据读入数据帧my_df = spark.read.csv("/path/to/my_folder/*", sep="\t")别的:print("无法重命名文件扩展名！")

I have the following folder of partitioned data-

my_folder
 |--part-0000.gzip
 |--part-0001.gzip
 |--part-0002.gzip
 |--part-0003.gzip

I try to read this data into a dataframe using-

>>> my_df = spark.read.csv("/path/to/my_folder/*")
>>> my_df.show(5)
+--------------------+
|                 _c0|
+--------------------+
|��[I���...|
|��RUu�[*Ք��g��T...|
|�t���  �qd��8~��...|
|�(���b4�:������I�...|
|���!y�)�PC��ќ\�...|
+--------------------+
only showing top 5 rows

Also tried using this to check the data-

>>> rdd = sc.textFile("/path/to/my_folder/*")
>>> rdd.take(4)
['\x1f�\x08\x00\x00\x00\x00\x00\x00\x00�͎\\ǖ�7�~�\x04�\x16��\'��"b�\x04�AR_<G��"u��\x06��L�*�7�J�N�\'�qa��\x07\x1ey��\x0b\\�\x13\x0f\x0c\x03\x1e�Q��ڏ�\x15Y_Yde��Y$��Q�JY;s�\x1d����[��\x15k}[B\x01��ˀ�PT��\x12\x07-�\x17\x12�\x0c#\t���T۱\x01yf��\x14�S\x0bc)��\x1ex���axAO˓_\'��`+HM҈�\x12�\x17�@']

NOTE: When I do a zcat part-0000.gzip | head -1 to read the file content, it shows the data is tab separated and in plain readable English.

How do I read these files properly into a dataframe?

解决方案

For some reason, Spark does not recognize the .gzip file extension. So I had to change the file extensions before reading the partitioned data-

import os

# go to my_folder
os.chdir("/path/to/my_folder")

# renaming all `.gzip` extensions to `.gz` within my_folder
cmd = 'rename "s/gzip/gz/" *.gzip'
result_code = os.system(cmd)

if result_code == 0:
    print("Successfully renamed the file extensions!")

    # finally reading the data into a dataframe
    my_df = spark.read.csv("/path/to/my_folder/*", sep="\t")
else:
    print("Could not rename the file extensions!")

这篇关于如何将多个分区的 .gzip 文件读入 Spark Dataframe?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！