我有以下代码来分析一个巨大的数据帧文件(22G,超过200万行和3K列)我在一个较小的数据帧中测试了代码,它运行正常(head -1000 hugefile.txt)。然而,当我在这个巨大的数据帧上运行代码时,它给了我“分段错误”的核心转储。它输出一个core.number二进制文件。
我做了一些网络搜索,提出使用low_memory =False,并试图通过定义chunksize=1000, iterator= True来读取数据帧,然后使用pandas.concat块,但这仍然给我带来内存问题(核心转储)它甚至不会在核心转储之前读取整个文件,因为我测试的只是读取文件并打印一些文本请帮助我,让我知道如果有解决方案,我可以分析这个巨大的文件。
版本
python版本:3.6.2
Numpy版本:1.13.1
熊猫版:0.20.3
操作系统:Linux/Unix
脚本

#!/usr/bin/python
import pandas as pd
import numpy as np

path = "/path/hugefile.txt"
data1 = pd.read_csv(path, sep='\t', low_memory=False,chunksize=1000, iterator=True)
data = pd.concat(data1, ignore_index=True)

#######

i=0
marker_keep = 0
marker_remove = 0
while(i<(data.shape[0])):
    j=5 #starts at 6
    missing = 0
    NoNmiss = 0
    while (j < (data.shape[1]-2)):
        if pd.isnull(data.iloc[i,j]) == True:
            missing = missing +1
            j= j+3
        elif ((data.iloc[i,j+1] >=10) & (((data.iloc[i,j+1])/(data.iloc[i,j+2])) > 0.5)):
            NoNmiss = NoNmiss +1
            j=j+3
        else:
            missing = missing +1
            j= j+3
    if (NoNmiss/(missing+NoNmiss)) >= 0.5:
        marker_keep = marker_keep + 1
    else:
        marker_remove = marker_remove +1
    i=i+1


a = str(marker_keep)
b= str(marker_remove)
c = "marker keep: " + a + "; marker remove: " +b
result = open('PyCount_marker_result.txt', 'w')
result.write(c)
result.close()

示例数据集:
Index   Group   Number1 Number2 DummyCol    sample1.NA  sample1.NA.score    sample1.NA.coverage sample2.NA  sample2.NA.score    sample2.NA.coverage sample3.NA  sample3.NA.score    sample3.NA.coverage
1   group1  13247   13249   Marker  CC  3   1   NA  0   0   NA  0   0
2   group1  13272   13274   Marker  GG  7   6   GG  3   1   GG  3   1
4   group1  13301   13303   Marker  CC  11  12  CC  5   4   CC  5   3
5   group1  13379   13381   Marker  CC  6   5   CC  5   4   CC  5   3
7   group1  13417   13419   Marker  GG  7   6   GG  4   2   GG  5   4
8   group1  13457   13459   Marker  CC  13  15  CC  9   9   CC  11  13
9   group1  13493   13495   Marker  AA  17  21  AA  11  12  AA  11  13
10  group1  13503   13505   Marker  GG  14  17  GG  9   10  GG  13  15
11  group1  13549   13551   Marker  GG  6   5   GG  4   2   GG  6   5
12  group1  13648   13650   Marker  NA  0   0   NA  0   0   NA  0   0
13  group1  13759   13761   Marker  NA  0   0   NA  0   0   NA  0   0
14  group1  13867   13869   Marker  NA  0   0   NA  0   0   NA  0   0
15  group1  13895   13897   Marker  CC  3   1   NA  0   0   NA  0   0
20  group1  14430   14432   Marker  GG  15  18  NA  0   0   GG  5   3
21  group1  14435   14437   Marker  GG  16  20  GG  3   1   GG  4   2
22  group1  14463   14465   Marker  AT  0   24  AA  3   1   TT  4   6
23  group1  14468   14470   Marker  CC  18  23  CC  3   1   CC  6   5
25  group1  14652   14654   Marker  CC  3   8   NA  0   0   CC  3   1
26  group1  14670   14672   Marker  GG  10  11  NA  0   0   NA  0   0

错误消息:
Traceback (most recent call last):
  File "test_script.py", line 8, in <module>
    data = pd.concat(data1, ignore_index=True)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 206, in concat
    copy=copy)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 236, in __init__
    objs = list(objs)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 978, in __next__
    return self.get_chunk()
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1042, in get_chunk
    return self.read(nrows=size)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10885)
  File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
  File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
  File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
/opt/gridengine/default/Federation/spool/execd/kcompute030/job_scripts/5883517: line 10: 29990 Segmentation fault      (core dumped) python3.6 test_script.py

最佳答案

你根本就没有成批地处理你的数据。
对于data1 = pd.read_csv('...', chunksize=10000, iterator=True)
data1变成一个pandas.io.parser.TextFileReader,它是一个迭代器,以数据帧的形式生成10000行csv数据块。
但随后pd.concat会消耗整个迭代器,因此试图将整个csv加载到内存中,从而完全破坏了使用chunksizeiterator的目的。
正确使用chunksizeiterator
为了以块的形式处理数据,必须遍历迭代器read.csv提供的实际数据帧块。

data1 = pd.read_csv(path, sep='\t',chunksize=1000, iterator=True)

for chunk in data1:
    # do my processing of DataFrame chunk of 1000 rows here

最小示例
假设我们有一个csv bigdata.txt
A1, A2
B1, B2
C1, C2
D1, D2
E1, E2

我们想一次处理一行(不管是什么原因)。
chunksizeiterator的错误用法
df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)

df = pd.concat(df_iter)
df
##     0    1
## 0  A1   A2
## 1  B1   B2
## 2  C1   C2
## 3  D1   D2
## 4  E1   E2

我们可以看到,我们已经将整个csv加载到内存中,尽管chunksize为1。
正确使用
df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)

for iter_num, chunk in enumerate(df_iter, 1):
    print('Processing iteration {0}'.format(iter_num))
    print(chunk)

##  Processing iteration 1
##      0    1
##  0  A1   A2
##  Processing iteration 2
##      0    1
##  1  B1   B2
##  Processing iteration 3
##      0    1
##  2  C1   C2
##  Processing iteration 4
##      0    1
##  3  D1   D2
##  Processing iteration 5
##      0    1
##  4  E1   E2

07-26 09:34
查看更多