我有一个包含以下文本的文件:

$ more audit.log
2018-01-31 15:34:08 GMT:10.34.160.60(63788):agent3 @ pem:[31884] 00000:LOG:语句:DROP TABLE tmp_zombies
2018-01-31 15:58:52 GMT:127.0.0.1(45050):agent1 @ pem:[13182] 00000:LOG:语句:CREATE TEMP TABLE tmp_zombies(jagpid int4)
2018-01-31 15:58:52 GMT:127.0.0.1(45050):agent1 @ pem:[13182] 00000:LOG:语句:DROP TABLE tmp_zombies
2018-01-31 16:24:00 GMT:10.34.160.55(57199):agent8 @ pem:[27888] 00000:LOG:语句:CREATE TEMP TABLE tmp_zombies(jagpid int4)
2018-01-31 16:24:00 GMT:10.34.160.55(57199):agent8 @ pem:[27888] 00000:LOG:语句:DROP TABLE tmp_zombies
2018-01-31 21:08:47 GMT:[本地]:pgsql @ p106:[26349] 00000:LOG:语句:创建表global_pg_audit
        (
           角色名称文字不为null,
           stmt_timestamp时间戳记不为null,
           source_ip文本,
           target_ip文本,
           dbname文本,
           pid文字,
           statement_type文字,
           声明文字
        );
2018-01-31 15:34:08 GMT:10.34.160.60(63788):agent3 @ pem:[31884] 00000:LOG:语句:DROP TABLE tmp_zombies


当我运行此python代码时:

    汇入
    fullpathname ='。/ audit.log'
    regex_pattern = re.compile(r'^(\ d {4}-\ d {2}-\ d {2} \ d {2}:\ d {2}:\ d {2})(。*?) $',re.MULTILINE | re.DOTALL)
    使用open(fullpathname,'r')as f:
        log_entries = regex_pattern.findall(f.read())
    计数器= 0
    在log_entries中输入:
        打印'%d => ['%(counter),entry,']'
        counter = counter + 1


输出如下:

0 => [('2018-01-31 15:34:08','GMT:10.34.160.60(63788):agent3 @ pem:[31884] 00000:LOG:语句:DROP TABLE tmp_zombies']]
1 => [('2018-01-31 15:58:52','GMT:127.0.0.1(45050):agent1 @ pem:[13182] 00000:LOG:语句:CREATE TEMP TABLE tmp_zombies(jagpid int4)' )]
2 => [('2018-01-31 15:58:52','GMT:127.0.0.1(45050):agent1 @ pem:[13182] 00000:LOG:语句:DROP TABLE tmp_zombies']]
3 => [('2018-01-31 16:24:00','GMT:10.34.160.55(57199):agent8 @ pem:[27888] 00000:LOG:语句:CREATE TEMP TABLE tmp_zombies(jagpid int4)' )]
4 => [('2018-01-31 16:24:00','GMT:10.34.160.55(57199):agent8 @ pem:[27888] 00000:LOG:语句:DROP TABLE tmp_zombies']]
5 => [('2018-01-31 21:08:47','GMT:[本地]:pgsql @ p106:[26349] 00000:LOG:语句:创建表global_pg_audit')]
6 => [('2018-01-31 15:34:08','GMT:10.34.160.60(63788):agent3 @ pem:[31884] 00000:LOG:语句:DROP TABLE tmp_zombies']]
7 => [('2018-01-31 15:58:52','GMT:127.0.0.1(45050):agent1 @ pem:[13182] 00000:LOG:语句:CREATE TEMP TABLE tmp_zombies(jagpid int4)' )]



请注意,在输出的第5行中,代码未包含应为以下内容的整个语句:

    创建表global_pg_audit
        (
           角色名称文字不为null,
           stmt_timestamp时间戳记不为null,
           source_ip文本,
           target_ip文本,
           dbname文本,
           pid文字,
           statement_type文字,
           声明文字
        );


代码有什么问题?

非常感谢!

最佳答案

您的正则表达式锚定到该行的末尾:

^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(.*?)$


由于启用了多行模式,因此$在换行符处匹配。这就是为什么比赛在global_pg_audit之后结束的原因。



您要匹配到下一个以日期开头的行。您可以先行执行此操作:

^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(.*?)(?=\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z)


替换|\Z允许正则表达式匹配最后一行,即使它后面没有日期。

另请参见regex demo

关于python - 使用python正则表达式解析文本文件中的相关行组,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/48553106/

10-13 03:07