

我们有一个日志分析器,可以解析大约 100GB 的日志(我的测试文件是大约 2000 万行,1.8GB).它花费的时间比我们希望的要长(超过半天),所以我针对 cProfile 运行它,并且 strptime 占用了超过 75% 的时间:

We've got a log analyzer which parses logs on the order of 100GBs (my test file is ~20 million lines, 1.8GB). It's taking longer than we'd like (upwards of half a day), so I ran it against cProfile and >75% of the time is being taken by strptime:

       1    0.253    0.253  560.629  560.629 <string>:1(<module>)
20000423  202.508    0.000  352.246    0.000 _strptime.py:299(_strptime)


to calculate the durations between log entries, currently as:

ltime = datetime.strptime(split_line[time_col].strip(), "%Y-%m-%d %H:%M:%S")
lduration = (ltime - otime).total_seconds()

其中 otime 是上一行的时间戳

where otime is the time stamp from the previous line


0000 | 774 | 475      | 2017-03-29 00:06:47 | M      |        63
0001 | 774 | 475      | 2017-03-29 01:09:03 | M      |        63
0000 | 774 | 475      | 2017-03-29 01:19:50 | M      |        63
0001 | 774 | 475      | 2017-03-29 09:42:57 | M      |        63
0000 | 775 | 475      | 2017-03-29 10:24:34 | M      |        63
0001 | 775 | 475      | 2017-03-29 10:33:46 | M      |        63

针对测试文件运行它需要将近 10 分钟.

It takes almost 10 minutes to run it against the test file.


def to_datetime(d):
    ltime = datetime.datetime(int(d[:4]),

将时间缩短到 3 分钟多一点.

brings that down to just over 3 minutes.

cProfile 再次报告:

cProfile again reports:

       1    0.265    0.265  194.538  194.538 <string>:1(<module>)
20000423   62.688    0.000   62.688    0.000 analyzer.py:88(to_datetime)

这个转换仍然需要大约三分之一的时间来运行整个分析器.内联将转换占用空间减少了大约 20%,但我们仍然认为处理这些行的时间有 25% 是将时间戳转换为 datetime 格式(使用 total_seconds() 在此基础上再消耗约 5%).

this conversion is still taking about a third of the time for the entire analyzer to run. In-lining reduces the conversions footprint by about 20%, but we're still looking at 25% of the time to process these lines is converting the timestamp to datetime format (with total_seconds() consuming another ~5% on top of that).

我可能最终只写一个自定义时间戳到秒的转换来完全绕过 datetime,除非有人有另一个好主意?

I may end up just writing a custom timestamp to seconds conversion to bypass datetime entirely, unless someone has another bright idea?



So I kept looking and I've found a module that does a fantastic job:

介绍 ciso8601:

from ciso8601 import parse_datetime
ltime = parse_datetime(sline[time_col].strip())

通过 cProfile:

Which, via cProfile:

       1    0.254    0.254  123.795  123.795 <string>:1(<module>)
20000423    4.188    0.000    4.188    0.000 {ciso8601.parse_datetime}

这比通过 datetime.strptime() 的朴素方法快约 84 倍...这并不奇怪,因为它们编写了一个 C 模块来完成它.

which is ~84x faster than the naive approach via datetime.strptime()... which is not surprising, given they wrote a C module to do it.


