问题描述
是否可以读取通过gz压缩为dask数据帧的.csv文件?
Is there a way to read a .csv file that is compressed via gz into a dask dataframe?
我已经直接尝试过
import dask.dataframe as dd
df = dd.read_csv("Data.gz" )
但是出现一个unicode错误(可能是因为它正在解释压缩的字节)有一个"compression"
参数,但是compression = "gz"
无法正常工作,到目前为止我找不到任何文档.
but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression"
parameter but compression = "gz"
won't work and I can't find any documentation so far.
使用pandas,我可以直接读取文件,而不会耗尽内存;-),但如果我限制行数,它可以正常工作.
With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the number of lines it works fine.
import pandas.Dataframe as pd
df = pd.read_csv("Data.gz", ncols=100)
推荐答案
这实际上是黄昏.使用dask.delayed
代替加载文件:
It's actually a long-standing limitation of dask. Load the files with dask.delayed
instead:
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
filenames = ...
dfs = [delayed(pd.read_csv)(fn) for fn in filenames]
df = dd.from_delayed(dfs) # df is a dask dataframe
这篇关于如何将压缩的(gz)CSV文件读入dask数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!