问题描述
我几乎在整个互联网上进行了搜索,但不知何故,这些方法似乎都不适用于我的情况.
I searched almost all over the internet and somehow none of the approaches seem to work in my case.
我有两个大型 csv 文件(每个都有一百万行以上,大小约为 300-400MB).他们使用 read_csv 函数可以很好地加载到数据帧中,而无需使用 chunksize 参数.我什至对这些数据进行了一些小的操作,比如新列的生成、过滤等.
I have two large csv files (each with a million+ rows and about 300-400MB in size). They are loading fine into data frames using the read_csv function without having to use the chunksize parameter.I even performed certain minor operations on this data like new column generation, filtering, etc.
但是,当我尝试合并这两个帧时,我收到了 MemoryError.我什至尝试使用 SQLite 来完成合并,但徒劳无功.手术需要永远.
However, when I try to merge these two frames, I get a MemoryError. I have even tried to use SQLite to accomplish the merge, but in vain. The operation takes forever.
我的是一台带有 8GB RAM 的 Windows 7 PC.Python 版本为 2.7
Mine is a Windows 7 PC with 8GB RAM. The Python version is 2.7
谢谢.
我也尝试了分块方法.当我这样做时,我没有收到 MemoryError,但 RAM 使用量激增,我的系统崩溃了.
I tried chunking methods too. When I do this, I don't get MemoryError, but the RAM usage explodes and my system crashes.
推荐答案
当您使用 pandas.merge 合并数据时,它将使用 df1 内存、df2 内存和 merge_df 内存.我相信这就是您收到内存错误的原因.您应该将 df2 导出到 csv 文件并使用 chunksize 选项并合并数据.
When you are merging data using pandas.merge it will use df1 memory, df2 memory and merge_df memory. I believe that it is why you get a memory error. You should export df2 to a csv file and use chunksize option and merge data.
这可能是更好的方法,但您可以试试这个.*对于大数据集,您可以在 pandas.read_csv 中使用 chunksize 选项
It might be a better way but you can try this.*for large data set you can use chunksize option in pandas.read_csv
df1 = pd.read_csv("yourdata.csv")
df2 = pd.read_csv("yourdata2.csv")
df2_key = df2.Colname2
# creating a empty bucket to save result
df_result = pd.DataFrame(columns=(df1.columns.append(df2.columns)).unique())
df_result.to_csv("df3.csv",index_label=False)
# save data which only appear in df1 # sorry I was doing left join here. no need to run below two line.
# df_result = df1[df1.Colname1.isin(df2.Colname2)!=True]
# df_result.to_csv("df3.csv",index_label=False, mode="a")
# deleting df2 to save memory
del(df2)
def preprocess(x):
df2=pd.merge(df1,x, left_on = "Colname1", right_on = "Colname2")
df2.to_csv("df3.csv",mode="a",header=False,index=False)
reader = pd.read_csv("yourdata2.csv", chunksize=1000) # chunksize depends with you colsize
[preprocess(r) for r in reader]
这会将合并的数据保存为 df3.
this will save merged data as df3.
这篇关于当我合并两个 Pandas 数据帧时出现 MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!