python - Python:多个文件处理非常慢

我必须同时阅读两种不同类型的文件，以便同步它们的数据。文件以不同的频率并行生成。
文件1的大小将非常大（>10gb），其结构如下：数据是一个包含100个字符的字段，其后面的数字是两个文件共用的同步信号（即，它们在两个文件中同时改变）。

DATA 1
DATA 1
... another 4000 lines
DATA 1
DATA 0
... another 4000 lines and so on

文件2，小（最多10 MB，但更多）具有相同的结构，其区别在于同步信号变化之间的行数：

DATA 1
... another 300-400 lines
DATA 1
DATA 0
... and so on

下面是我用来读取文件的代码：

def getSynchedChunk(fileHandler, lastSynch, end_of_file):

    line_vector = [];                         # initialize output array
    for line in fileHandler:                  # iterate over the file
        synch = int(line.split(';')[9]);      # get synch signal
        line_vector.append(line);
        if synch != lastSynch:                # if a transition is detected
            lastSynch = synch;                # update the lastSynch variable for later use
            return (lastSynch, line_vector, True); # and exit - True = sycnh changed

     return (lastSynch, line_vector, False); # exit if end of file is reached

我必须同步数据块（具有相同同步信号值的行）并将新行写入另一个文件。
我在用Spyder。
对于测试，我使用较小的文件，文件1为350MB，文件2为35MB。
我还使用了内置的Profiler来查看在哪里花费的时间最多，46个文件中有28个用于实际读取文件中的数据。其余部分用于同步数据并写入新文件。
如果我将时间扩展到gigs大小的文件，完成处理将需要几个小时。我将尝试改变处理的方式，使其更快，但有没有一种更快的方式来读取大文件？
一行数据如下：

01/31/19 08:20:55.886;0.049107050;-0.158385641;9.457415342;-0.025256720;-0.017626805;-0.000096349;0.107;-0.112;0

这些值是传感器测量值。最后一个数字是同步值。

最佳答案

我建议先阅读整个文件，然后再进行处理。这有一个巨大的优势，即所有的附加/连接等，而阅读是在内部完成优化模块。同步可以在之后进行。
为此，我强烈建议使用pandas，这是迄今为止处理时间序列数据（如测量）的最佳工具。
导入文件时，在文本文件中猜测csv是正确的格式，可以使用：

df = pd.read_csv(
    'DATA.txt', sep=';', header=None, index_col=0,
    parse_dates=True, infer_datetime_format=True, dayfirst=True)

为了减少内存消耗，您可以指定chunksize来分割文件读取，或者指定low_memory=True来内部分割文件读取过程（假设最终数据帧适合您的内存）：

df = pd.read_csv(
    'DATA.txt', sep=';', header=None, index_col=0,
    parse_dates=True, infer_datetime_format=True, dayfirst=True,
    low_memory=True)

现在您的数据将存储在DataFrame中，这非常适合时间序列。索引已转换为DateTimeIndex，这将允许进行良好的绘图、重新采样等。。。
现在可以很容易地访问sync状态，就像在numpy数组中一样（只需添加iloc访问方法）：

df.iloc[:, 8]  # for all sync states
df.iloc[0, 8]  # for the first synch state
df.iloc[1, 8]  # for the second synch state

这是使用两个或多个文件的快速矢量同步的理想选择。
要根据可用内存读取文件：

try:
    df = pd.read_csv(
        'DATA.txt', sep=';', header=None, index_col=0,
        parse_dates=True, infer_datetime_format=True, dayfirst=True)
except MemoryError:
    df = pd.read_csv(
        'DATA.txt', sep=';', header=None, index_col=0,
        parse_dates=True, infer_datetime_format=True, dayfirst=True,
        low_memory=True)

这个try/except解决方案可能不是一个优雅的解决方案，因为在引发MemoryError之前需要一些时间，但它是故障保护的。而且由于low_memory=True在大多数情况下很可能会降低文件读取性能，因此try块在大多数情况下应该更快。

关于python - Python:多个文件处理非常慢，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/54570844/