问题描述
我有一些我无法理解的CPython问题.归结为以下事实:使用相同的代码可以读取小型文本文件,但甚至无法从20GB的txt文件中读取一行.
I have some CPython issue that I cannot understand. It all boils down to the fact that using the same code to read small text file works but cannot even read a single line from 20GB txt file.
一些有用的信息:
- 较小的文件〜1MB是20GB的大文件的子集(从一开始就是1MB)
- 两个文件都是文本文件,其行宽约2000个字符,以CR(\ r)分隔
显而易见的解决方案:
f = open(r'filename', 'r')
for line in f:
print(line)
f.close()
有效...但是..仅适用于短文件.因为大的文件永远挂着(或者至少要打印第一行才需要更长的时间).
works...but..only for short file. For the big one hangs forever (or longer that it should take to print at least the first line).
所以我至少想尝试读这样的一行:
So I wanted to at least try to read one line like this:
f = open(r'filename', 'r')
print(f.readline())
f.close()
这里的情况类似-立即处理较小的文件,但经过大量时间吐出该消息后才处理较大的文件:
Similar situation here - works for small file instantly but for the big one after substantial amount of time spits that message:
Traceback (most recent call last):
File "***", line 16, in <module>
print(f.readline())
SystemError: ..\Objects\stringobject.c:3902: bad argument to internal function
我该怎么读一个大文本文件?
How the heck should I read a big text file?
更新:
结果证明,人们认为睡眠充足会更清楚;-).问题已解决-事实证明我忽略了文档中的一句话:
Turns out human being thinks clearer whan having enough sleep ;-). The problem is solved - turns out I've overlooked one sentence in the documentation:
仅考虑默认情况下通用换行符已打开.
Just thought universal newlines are 'turned on' by default.
我的上述声明:
print(f.readline())
正在读的只有一行是部分错误(我的错).还记得我说过我的小文件是通过提取大文件中的一部分来创建的吗?在该操作过程中,行的结尾从(CR)更改为(CRLF),所以我看到的是第一行.所有这些使我认为问题不在行尾.
was reading just one line was partially false (my bad). Remember I said my small file was created by taking chunk of the big one? During that operation line endings changed from (CR) to (CRLF) so what I saw was the first line. All of that made me think that problem is not in line endings.
谢谢大家的时间和帮助.
Thank you all for time and help.
推荐答案
尽管测试"仅打印一行,但这并不意味着它仅从文件读取一行.对我来说,在\r
分隔的测试文件中,我也只能得到一行输出.但是,如果我使用for
循环读取每一行,则它 still 仅打印一行.或者,如果我第二次尝试在多行文件中使用readline()
,则该文件不再显示任何行.
Although your "test" only prints one line, that does not mean it is only reading one line from the file. For me in a \r
-delimited test file, I also only get one line of output. However if I read each line in using a for
loop, it still only prints one line. Or if I try readline()
a second time on a multi-line file, it doesn't give any more lines.
尝试在同一文件上使用'rU'
参数打开文件:
Try opening your file with the 'rU'
parameter on the same file:
f = open('filename', 'rU')
我对带有多行以\r
分隔的文本的文件的测试给出:
My tests of a file with several lines of \r
-delimited text give:
f = open('test.txt','r') # Opening the "wrong" way
for line in f:
print line
输出:
abcdef
然后使用rU
:
f = open('test.txt','rU')
for line in f:
print line
输出:
abcdef
abcdef
abcdef
abcdef
abcdef
为支持Joran的解释,该测试几乎表明,当您仅看到一行内容时,就是整个文件正在加载并且回车符导致套印输出...
In support of Joran's explanation, this test pretty much shows it to be the case that the entire file is loading and the carriage return character is causing over-printing when you see only one line of output...
f = open('test.txt','r') # Opening the "wrong" way again
for line in f:
print "XXX{}YYY".format(line)
输出被覆盖...
YYYdefdef
这篇关于无法从CPython读取巨大的(20GB)文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!