本文介绍了如何将压缩的(gz)CSV文件读入dask数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以读取通过gz压缩为dask数据帧的.csv文件?

Is there a way to read a .csv file that is compressed via gz into a dask dataframe?

我已经直接尝试过

import dask.dataframe as dd
df = dd.read_csv("Data.gz" )

但是出现一个unicode错误(可能是因为它正在解释压缩的字节)有一个"compression"参数,但是compression = "gz"无法正常工作,到目前为止我找不到任何文档.

but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression" parameter but compression = "gz" won't work and I can't find any documentation so far.

使用pandas,我可以直接读取文件,而不会耗尽内存;-),但如果我限制行数,它可以正常工作.

With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the number of lines it works fine.

import pandas.Dataframe as pd
df = pd.read_csv("Data.gz", ncols=100)

推荐答案

这实际上是黄昏.使用dask.delayed 代替加载文件:

It's actually a long-standing limitation of dask. Load the files with dask.delayed instead:

import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed

filenames = ...
dfs = [delayed(pd.read_csv)(fn) for fn in filenames]

df = dd.from_delayed(dfs) # df is a dask dataframe

这篇关于如何将压缩的(gz)CSV文件读入dask数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 19:30