问题描述
在Python 3.6中,如果存在换行符,则读取文件所需的时间更长.如果我有两个文件,一个带有换行符,一个没有换行符(但否则它们具有相同的文本),那么带有换行符的文件将花费大约100-200%的时间来读取.我提供了一个具体示例.
In Python 3.6, it takes longer to read a file if there are line breaks. If I have two files, one with line breaks and one without lines breaks (but otherwise they have the same text) then the file with line breaks will take around 100-200% the time to read. I have provided a specific example.
sizeMB = 128
sizeKB = 1024 * sizeMB
with open(r'C:\temp\bigfile_one_line.txt', 'w') as f:
for i in range(sizeKB):
f.write('Hello World!\t'*73) # There are roughly 73 phrases in one KB
with open(r'C:\temp\bigfile_newlines.txt', 'w') as f:
for i in range(sizeKB):
f.write('Hello World!\n'*73)
第2步:读取一行内容并具有时间表现的文件
IPython
%%timeit
with open(r'C:\temp\bigfile_one_line.txt', 'r') as f:
text = f.read()
输出
1 loop, best of 3: 368 ms per loop
步骤3:读取具有很多行和时间性能的文件
IPython
%%timeit
with open(r'C:\temp\bigfile_newlines.txt', 'r') as f:
text = f.read()
输出
1 loop, best of 3: 589 ms per loop
这只是一个例子.我已经针对许多不同的情况对此进行了测试,并且它们执行相同的操作:
This is just one example. I have tested this for many different situations, and they do the same thing:
- 从1MB到2GB的不同文件大小
- 使用file.readlines()代替file.read()
- 在单行文件(即"Hello World!")中使用空格代替制表符('\ t')
我的结论是,带有换行符('\ n')的文件比没有换行符的文件需要更长的读取时间.但是,我希望所有字符都一样.读取大量文件时,这可能会对性能产生重要影响. 有人知道为什么会这样吗?
My conclusion is that files with new lines characters ('\n') take longer to read than files without them. However, I would expect all characters to be treated the same. This can have important consequences for performance when reading a lot of files. Does anyone know why this happens?
我正在使用Python 3.6.1,Anaconda 4.3.24和Windows10.
I am using Python 3.6.1, Anaconda 4.3.24, and Windows 10.
推荐答案
以文本模式(默认)在Python中打开文件时,它使用的是通用换行符"( PEP 278 ,但后来随着Python 3的发布有所改变.通用换行符的含义是,无论文件中使用哪种换行符,您在Python中都只会看到\n
.因此,包含foo\nbar
的文件看起来与包含foo\r\nbar
或foo\rbar
的文件相同(因为\n
,\r\n
和\r
都是在某些操作系统上有时使用的所有行尾约定).
When you open a file in Python in text mode (the default), it uses what it calls "universal newlines" (introduced with PEP 278, but somewhat changed later with the release of Python 3). What universal newlines means is that regardless of what kind of newline characters are used in the file, you'll see only \n
in Python. So a file containing foo\nbar
would appear the same as a file containing foo\r\nbar
or foo\rbar
(since \n
, \r\n
and \r
are all line ending conventions used on some operating systems at some time).
提供支持的逻辑可能是导致性能差异的原因.即使文件中的\n
字符没有被转换,与非换行字符相比,代码也需要更仔细地检查它们.
The logic that provides that support is probably what causes your performance differences. Even if the \n
characters in the file are not being transformed, the code needs to examine them more carefully than it does non-newline characters.
我怀疑如果您以二进制模式(没有提供此类换行符支持)打开文件,则看到的性能差异会消失.您还可以在Python 3中将newline
参数传递给open
,具体取决于您提供的值,该参数可能具有各种含义.我不知道任何特定的值会对性能产生什么影响,但是如果您看到的性能差异实际上对程序很重要,则可能值得测试.我会尝试传递newline=""
和newline="\n"
(或您平台常规行尾的任何内容).
I suspect the performance difference you see will disappear if you opened your files in binary mode where no such newline support is provided. You can also pass a newline
parameter to open
in Python 3, which can have various meanings depending on exactly what value you give. I have no idea what impact any specific value would have on performance, but it might be worth testing if the performance difference you're seeing actually matters to your program. I'd try passing newline=""
and newline="\n"
(or whatever your platform's conventional line ending is).
这篇关于为什么在没有换行符的情况下读取文件更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!