在python中按列连接两个大文件

本文介绍了在python中按列连接两个大文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有2个文件，每个文件具有38374732行，每个文件的大小为3.3G.我正在尝试将他们加入第一列.为此，我决定将pandas与从Stackoverflow提取的以下代码一起使用:

I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow:

 import pandas as pd
 import sys
 a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig")
 b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig")
 chunksize = 10 ** 6
 for chunk in a(chunksize=chunksize):
   merged = chunk.merge(b, on='Bin_ID')
   merged.to_csv("output.csv", index=False,sep='\t')

但是我遇到内存错误(不足为奇).我查看了带有大块大熊猫代码的代码(类似这样的)，但是如何在一个循环中为两个文件实现该文件，并且我认为我无法对第二个文件进行分块，因为我需要在其中查找列整个第二个文件.有没有解决的办法?

However I am getting memory error(not surprising). I looked up at the code with chunks for pandas (something like this How to read a 6 GB csv file with pandas), however how do I implement it for two files in a loop and I don't think I can chunk the second file as I need to lookup for column in the whole second file.Is there a way out for this?

推荐答案

在其他帖子(例如您提到的帖子)中已经对此进行了讨论(，或此，或此).

This is already discussed in other posts like the one you mentioned (this, or this, or this).

如此处所述，我将尝试使用黄昏数据框加载数据并执行合并，但是您可能仍无法运行，具体取决于您的PC.

As it is explained there, I would try to use dask dataframe to load the data and execute the merge, but depending on your PC you may still not be able to do it.

最小工作示例:

import dask.dataframe as dd

# Read the CSVs
df1 = dd.read_csv('data1.csv')
df2 = dd.read_csv('data2.csv')

# Merge them
df = dd.merge(df1, df2, on='Bin_ID').compute()

# Save the merged dataframe
df.to_csv('merged.csv', index=False)

这篇关于在python中按列连接两个大文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！