我有以下代码来分析一个巨大的数据帧文件(22G,超过200万行和3K列)我在一个较小的数据帧中测试了代码,它运行正常(head -1000 hugefile.txt
)。然而,当我在这个巨大的数据帧上运行代码时,它给了我“分段错误”的核心转储。它输出一个core.number二进制文件。
我做了一些网络搜索,提出使用low_memory =False
,并试图通过定义chunksize=1000, iterator= True
来读取数据帧,然后使用pandas.concat块,但这仍然给我带来内存问题(核心转储)它甚至不会在核心转储之前读取整个文件,因为我测试的只是读取文件并打印一些文本请帮助我,让我知道如果有解决方案,我可以分析这个巨大的文件。
版本
python版本:3.6.2
Numpy版本:1.13.1
熊猫版:0.20.3
操作系统:Linux/Unix
脚本
#!/usr/bin/python
import pandas as pd
import numpy as np
path = "/path/hugefile.txt"
data1 = pd.read_csv(path, sep='\t', low_memory=False,chunksize=1000, iterator=True)
data = pd.concat(data1, ignore_index=True)
#######
i=0
marker_keep = 0
marker_remove = 0
while(i<(data.shape[0])):
j=5 #starts at 6
missing = 0
NoNmiss = 0
while (j < (data.shape[1]-2)):
if pd.isnull(data.iloc[i,j]) == True:
missing = missing +1
j= j+3
elif ((data.iloc[i,j+1] >=10) & (((data.iloc[i,j+1])/(data.iloc[i,j+2])) > 0.5)):
NoNmiss = NoNmiss +1
j=j+3
else:
missing = missing +1
j= j+3
if (NoNmiss/(missing+NoNmiss)) >= 0.5:
marker_keep = marker_keep + 1
else:
marker_remove = marker_remove +1
i=i+1
a = str(marker_keep)
b= str(marker_remove)
c = "marker keep: " + a + "; marker remove: " +b
result = open('PyCount_marker_result.txt', 'w')
result.write(c)
result.close()
示例数据集:
Index Group Number1 Number2 DummyCol sample1.NA sample1.NA.score sample1.NA.coverage sample2.NA sample2.NA.score sample2.NA.coverage sample3.NA sample3.NA.score sample3.NA.coverage
1 group1 13247 13249 Marker CC 3 1 NA 0 0 NA 0 0
2 group1 13272 13274 Marker GG 7 6 GG 3 1 GG 3 1
4 group1 13301 13303 Marker CC 11 12 CC 5 4 CC 5 3
5 group1 13379 13381 Marker CC 6 5 CC 5 4 CC 5 3
7 group1 13417 13419 Marker GG 7 6 GG 4 2 GG 5 4
8 group1 13457 13459 Marker CC 13 15 CC 9 9 CC 11 13
9 group1 13493 13495 Marker AA 17 21 AA 11 12 AA 11 13
10 group1 13503 13505 Marker GG 14 17 GG 9 10 GG 13 15
11 group1 13549 13551 Marker GG 6 5 GG 4 2 GG 6 5
12 group1 13648 13650 Marker NA 0 0 NA 0 0 NA 0 0
13 group1 13759 13761 Marker NA 0 0 NA 0 0 NA 0 0
14 group1 13867 13869 Marker NA 0 0 NA 0 0 NA 0 0
15 group1 13895 13897 Marker CC 3 1 NA 0 0 NA 0 0
20 group1 14430 14432 Marker GG 15 18 NA 0 0 GG 5 3
21 group1 14435 14437 Marker GG 16 20 GG 3 1 GG 4 2
22 group1 14463 14465 Marker AT 0 24 AA 3 1 TT 4 6
23 group1 14468 14470 Marker CC 18 23 CC 3 1 CC 6 5
25 group1 14652 14654 Marker CC 3 8 NA 0 0 CC 3 1
26 group1 14670 14672 Marker GG 10 11 NA 0 0 NA 0 0
错误消息:
Traceback (most recent call last):
File "test_script.py", line 8, in <module>
data = pd.concat(data1, ignore_index=True)
File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 206, in concat
copy=copy)
File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 236, in __init__
objs = list(objs)
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 978, in __next__
return self.get_chunk()
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1042, in get_chunk
return self.read(nrows=size)
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10885)
File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
/opt/gridengine/default/Federation/spool/execd/kcompute030/job_scripts/5883517: line 10: 29990 Segmentation fault (core dumped) python3.6 test_script.py
最佳答案
你根本就没有成批地处理你的数据。
对于data1 = pd.read_csv('...', chunksize=10000, iterator=True)
,data1
变成一个pandas.io.parser.TextFileReader
,它是一个迭代器,以数据帧的形式生成10000行csv数据块。
但随后pd.concat
会消耗整个迭代器,因此试图将整个csv加载到内存中,从而完全破坏了使用chunksize
和iterator
的目的。
正确使用chunksize
和iterator
为了以块的形式处理数据,必须遍历迭代器read.csv
提供的实际数据帧块。
data1 = pd.read_csv(path, sep='\t',chunksize=1000, iterator=True)
for chunk in data1:
# do my processing of DataFrame chunk of 1000 rows here
最小示例
假设我们有一个csv bigdata.txt
A1, A2
B1, B2
C1, C2
D1, D2
E1, E2
我们想一次处理一行(不管是什么原因)。
chunksize
和iterator
的错误用法df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)
df = pd.concat(df_iter)
df
## 0 1
## 0 A1 A2
## 1 B1 B2
## 2 C1 C2
## 3 D1 D2
## 4 E1 E2
我们可以看到,我们已经将整个csv加载到内存中,尽管
chunksize
为1。正确使用
df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)
for iter_num, chunk in enumerate(df_iter, 1):
print('Processing iteration {0}'.format(iter_num))
print(chunk)
## Processing iteration 1
## 0 1
## 0 A1 A2
## Processing iteration 2
## 0 1
## 1 B1 B2
## Processing iteration 3
## 0 1
## 2 C1 C2
## Processing iteration 4
## 0 1
## 3 D1 D2
## Processing iteration 5
## 0 1
## 4 E1 E2