问题描述
我有2个文件,每个文件具有38374732行,每个文件的大小为3.3G.我正在尝试将他们加入第一列.为此,我决定将pandas与从Stackoverflow提取的以下代码一起使用:
I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow:
import pandas as pd
import sys
a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig")
b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig")
chunksize = 10 ** 6
for chunk in a(chunksize=chunksize):
merged = chunk.merge(b, on='Bin_ID')
merged.to_csv("output.csv", index=False,sep='\t')
但是我遇到内存错误(不足为奇).我查看了带有大块大熊猫代码的代码(类似这样的),但是如何在一个循环中为两个文件实现该文件,并且我认为我无法对第二个文件进行分块,因为我需要在其中查找列整个第二个文件.有没有解决的办法?
However I am getting memory error(not surprising). I looked up at the code with chunks for pandas (something like this How to read a 6 GB csv file with pandas), however how do I implement it for two files in a loop and I don't think I can chunk the second file as I need to lookup for column in the whole second file.Is there a way out for this?
推荐答案
在其他帖子(例如您提到的帖子)中已经对此进行了讨论(,或此 ,或此).
This is already discussed in other posts like the one you mentioned (this, or this, or this).
如此处所述,我将尝试使用黄昏数据框加载数据并执行合并,但是您可能仍无法运行,具体取决于您的PC.
As it is explained there, I would try to use dask dataframe to load the data and execute the merge, but depending on your PC you may still not be able to do it.
最小工作示例:
import dask.dataframe as dd
# Read the CSVs
df1 = dd.read_csv('data1.csv')
df2 = dd.read_csv('data2.csv')
# Merge them
df = dd.merge(df1, df2, on='Bin_ID').compute()
# Save the merged dataframe
df.to_csv('merged.csv', index=False)
这篇关于在python中按列连接两个大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!