python - 为什么readlines()读取的内容远大于sizehint？

背景

我正在Python 2.7.6中解析很大的文本文件（30GB +）。为了加快处理速度，我将文件拆分为多个块，然后使用多处理库将它们种植到子流程中。为此，我在主进程中遍历文件，记录要分割输入文件的字节位置，并将这些字节位置传递给子进程，然后打开子文件并使用file.readlines(chunk_size)读取其块。但是，我发现读取的块似乎比sizehint参数大（4x）。

问题

为什么不注意sizehint？

范例程式码

以下代码演示了我的问题：

import sys

# set test chunk size to 2KB
chunk_size = 1024 * 2

count = 0
chunk_start = 0
chunk_list = []

fi = open('test.txt', 'r')
while True:
    # increment chunk counter
    count += 1

    # calculate new chunk end, advance file pointer
    chunk_end = chunk_start + chunk_size
    fi.seek(chunk_end)

    # advance file pointer to end of current line so chunks don't have broken
    # lines
    fi.readline()
    chunk_end = fi.tell()

    # record chunk start and stop positions, chunk number
    chunk_list.append((chunk_start, chunk_end, count))

    # advance start to current end
    chunk_start = chunk_end

    # read a line to confirm we're not past the end of the file
    line = fi.readline()
    if not line:
        break

    # reset file pointer from last line read
    fi.seek(chunk_end, 0)

fi.close()

# This code represents the action taken by subprocesses, but each subprocess
# receives one chunk instead of iterating the list of chunks itself.
with open('test.txt', 'r', 0) as fi:
    # iterate over chunks
    for chunk in chunk_list:
        chunk_start, chunk_end, chunk_num = chunk

        # advance file pointer to chunk start
        fi.seek(chunk_start, 0)

        # print some notes and read in the chunk
        sys.stdout.write("Chunk #{0}: Size: {1} Start {2} Real Start: {3} Stop {4} "
              .format(chunk_num, chunk_end-chunk_start, chunk_start, fi.tell(), chunk_end))
        chunk = fi.readlines(chunk_end - chunk_start)
        print("Real Stop: {0}".format(fi.tell()))

        # write the chunk out to a file for examination
        with open('test_chunk{0}'.format(chunk_num), 'w') as fo:
            fo.writelines(chunk)

结果

我用大约23.3KB的输入文件（test.txt）运行了此代码，并产生了以下输出：

  区块＃1：大小：2052开始0实际开始：0停止2052实际停止：8193
  区块＃2：大小：2051开始2052实际开始：2052停止4103实际停止：10248
  区块＃3：大小：2050开始4103实际开始：4103停止6153实际停止：12298
  区块＃4：大小：2050开始6153实际开始：6153停止8203实际停止：14348
  区块＃5：大小：2050开始8203实际开始：8203停止10253实际停止：16398
  区块＃6：大小：2050开始10253实际开始：10253停止12303实际停止：18448
  区块＃7：大小：2050开始12303实际开始：12303停止14353实际停止：20498
  区块＃8：大小：2050开始14353实际开始：14353停止16403实际停止：22548
  区块＃9：大小：2050开始16403实际开始：16403停止18453实际停止：23893
  区块＃10：大小：2050开始18453实际开始：18453停止20503实际停止：23893
  区块＃11：大小：2050开始20503实际开始：20503停止22553实际停止：23893
  区块＃12：大小：2048开始22553实际开始：22553停止24601实际停止：23893

报告的每个块大小约为2KB，所有开始/停止位置均按其应有的方式排列，并且fi.tell()报告的实际文件位置似乎是正确的，因此，我可以肯定我的分块算法很好。但是，实际停止位置显示readlines()的读数远大于大小提示。另外，输出文件＃1-＃8为8.0KB，比大小提示大得多。

即使我尝试仅破坏行末的块是错误的，readlines()仍然不必读取超过2KB +一行的内容。文件＃9-＃12越来越小，这是有道理的，因为块的起始点越来越靠近文件的末尾，并且readlines()不会读取文件的末尾。

笔记

我的测试输入文件仅在每行1-5000上印有“ \ n”。
我再次尝试使用不同的块和输入文件大小，但结果相似。
readlines documentation表示读取的大小可能会四舍五入到内部缓冲区的大小，因此我尝试在不进行缓冲的情况下打开文件（如图所示），并且没有区别。
我正在使用此算法来分割文件，因为我需要能够支持* .bz2和* .gz压缩文件，而* .gz文件没有办法让我在不解压缩文件的情况下识别未压缩的文件大小。 * .bz2文件也不是，但是我可以从这些文件的末尾查找0个字节，并使用fi.tell()来获取文件大小。请参见my related question。
在添加支持压缩文件的要求之前，该脚本的先前版本使用os.path.getsize()作为分块循环的停止条件，并且readlines似乎可以很好地使用该方法。

最佳答案

readlines文档提到的缓冲区与open调用的第三个参数控制的缓冲区无关。缓冲区是this buffer in file_readlines：

static PyObject *
file_readlines(PyFileObject *f, PyObject *args)
{
    long sizehint = 0;
    PyObject *list = NULL;
    PyObject *line;
    char small_buffer[SMALLCHUNK];

其中SMALLCHUNK的定义较早：

#if BUFSIZ < 8192
#define SMALLCHUNK 8192
#else
#define SMALLCHUNK BUFSIZ
#endif

我不知道BUFSIZ的来源，但看来您正在获得#define SMALLCHUNK 8192案例。在任何情况下，readlines都不会使用小于8 KiB的缓冲区，因此您可能应该使块大于该值。