问题描述
我正在探索以长期的SAS用户身份切换到python和pandas的问题.
I am exploring switching to python and pandas as a long-time SAS user.
但是,当今天运行一些测试时,令我惊讶的是python尝试pandas.read_csv()
一个128mb的csv文件时内存不足.它有大约200,000行和200列的大多数数字数据.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv()
a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
使用SAS,我可以将一个csv文件导入SAS数据集,并且它的大小可以和我的硬盘一样大.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
在pandas
中是否有类似内容?
Is there something analogous in pandas
?
我经常处理大型文件,无法访问分布式计算网络.
I regularly work with large files and do not have access to a distributed computing network.
推荐答案
原则上它不应该用完内存,但是由于某些复杂的Python内部问题,当前read_csv
在大文件上存在内存问题(这是含糊不清,但已经有很长时间了: http://github.com/pydata/pandas/issues/407).
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv
on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
目前还没有完美的解决方案(这是一个乏味的解决方案:您可以将文件逐行转录为预分配的NumPy数组或内存映射文件-np.mmap
),但是我将在不久的将来进行研究.另一种解决方案是读取文件较小的文件(使用iterator=True, chunksize=1000
),然后将其连接,然后使用pd.concat
.当您将整个文本文件一次大地拉到内存中时,就会出现问题.
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap
), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000
) then concatenate then with pd.concat
. The problem comes in when you pull the entire text file into memory in one big slurp.
这篇关于 pandas 中的大型持久性DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!