问题描述
我们有一个日志分析器,可以解析大约 100GB 的日志(我的测试文件是大约 2000 万行,1.8GB).它花费的时间比我们希望的要长(超过半天),所以我针对 cProfile 运行它,并且 strptime 占用了超过 75% 的时间:
We've got a log analyzer which parses logs on the order of 100GBs (my test file is ~20 million lines, 1.8GB). It's taking longer than we'd like (upwards of half a day), so I ran it against cProfile and >75% of the time is being taken by strptime:
1 0.253 0.253 560.629 560.629 <string>:1(<module>)
20000423 202.508 0.000 352.246 0.000 _strptime.py:299(_strptime)
计算日志条目之间的持续时间,当前为:
to calculate the durations between log entries, currently as:
ltime = datetime.strptime(split_line[time_col].strip(), "%Y-%m-%d %H:%M:%S")
lduration = (ltime - otime).total_seconds()
其中 otime
是上一行的时间戳
where otime
is the time stamp from the previous line
日志文件的格式如下:
0000 | 774 | 475 | 2017-03-29 00:06:47 | M | 63
0001 | 774 | 475 | 2017-03-29 01:09:03 | M | 63
0000 | 774 | 475 | 2017-03-29 01:19:50 | M | 63
0001 | 774 | 475 | 2017-03-29 09:42:57 | M | 63
0000 | 775 | 475 | 2017-03-29 10:24:34 | M | 63
0001 | 775 | 475 | 2017-03-29 10:33:46 | M | 63
针对测试文件运行它需要将近 10 分钟.
It takes almost 10 minutes to run it against the test file.
用这个替换strptime()
(来自这个问题):
def to_datetime(d):
ltime = datetime.datetime(int(d[:4]),
int(d[5:7]),
int(d[8:10]),
int(d[11:13]),
int(d[14:16]),
int(d[17:19]))
将时间缩短到 3 分钟多一点.
brings that down to just over 3 minutes.
cProfile 再次报告:
cProfile again reports:
1 0.265 0.265 194.538 194.538 <string>:1(<module>)
20000423 62.688 0.000 62.688 0.000 analyzer.py:88(to_datetime)
这个转换仍然需要大约三分之一的时间来运行整个分析器.内联将转换占用空间减少了大约 20%,但我们仍然认为处理这些行的时间有 25% 是将时间戳转换为 datetime
格式(使用 total_seconds()
在此基础上再消耗约 5%).
this conversion is still taking about a third of the time for the entire analyzer to run. In-lining reduces the conversions footprint by about 20%, but we're still looking at 25% of the time to process these lines is converting the timestamp to datetime
format (with total_seconds()
consuming another ~5% on top of that).
我可能最终只写一个自定义时间戳到秒的转换来完全绕过 datetime
,除非有人有另一个好主意?
I may end up just writing a custom timestamp to seconds conversion to bypass datetime
entirely, unless someone has another bright idea?
推荐答案
所以我一直在寻找,我发现了一个非常出色的模块:
So I kept looking and I've found a module that does a fantastic job:
介绍 ciso8601:
from ciso8601 import parse_datetime
...
ltime = parse_datetime(sline[time_col].strip())
通过 cProfile:
Which, via cProfile:
1 0.254 0.254 123.795 123.795 <string>:1(<module>)
20000423 4.188 0.000 4.188 0.000 {ciso8601.parse_datetime}
这比通过 datetime.strptime()
的朴素方法快约 84 倍...这并不奇怪,因为它们编写了一个 C 模块来完成它.
which is ~84x faster than the naive approach via datetime.strptime()
... which is not surprising, given they wrote a C module to do it.
这篇关于用于持续时间计算的时间戳的快速转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!