Python:匹配之间的连续行类似于awk

本文介绍了Python:匹配之间的连续行类似于awk的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

多行字符串string(已经从文件file中读取)
两个模式pattern1和pattern2，它们将分别匹配string中仅一行的子字符串.这些行将称为line1和line2.

A multiline string string (already read from a file file)
Two patterns pattern1 and pattern2 which will match a substring of exactly one line in string each. These lines will be called line1 and line2.

这些模式是正则表达式模式，但如果可以简化，我可以更改其格式.

The patterns are regex-patterns, but I can change their format if that makes it easier.

我正在寻找一种方法来获取python中line1和line2之间的所有行(我们可以放心地假设line1在line2之前).

I am looking for a way to get all the lines between line1 and line2 in python (we can safely assume that line1 is before line2).

当然可以在for循环中使用pattern1设置的标志并在pattern2匹配时中断来完成此操作.不过，我在这里寻找更紧凑的解决方案.这是awk中的琐碎的oneliner:

Of course this could be done in a for loop with a flag set by pattern1 and a break when pattern2 matches. I am looking for a more compact solution here, though. This is a trivial oneliner in awk:

awk '/pattern1/,/pattern2/' file

示例:

文件:

aaa aa a
bbb bb b
ccc cc c
ddd dd d
eee ee e
fff ff f

模式1:b bb

pattern2:d dd

pattern2: d dd

所需结果:

bbb bb b
ccc cc c
ddd dd d

推荐答案

在awk中，/start/, /end/范围正则表达式将打印找到/start/的整行，直到并包括模式已找到.这是一个有用的构造，并已被Perl，sed，Ruby和其他人复制.

In awk the /start/, /end/ range regex prints the entire line that the /start/is found in up to and including the entire line where the /end/ pattern is found. It is a useful construct and has been copied by Perl, sed, Ruby and others.

要在Python中执行范围运算符，请编写一个类来跟踪对start运算符直到end运算符的上一次调用的状态.我们可以使用正则表达式(如awk一样)，也可以对其进行琐碎的修改以使其返回一行数据的True或False状态.

To do a range operator in Python, write a class that keeps track of the state of the previous call to the start operator until the end operator. We can use a regex (as awk does) or this can be trivially modified to anything returning a True or False status for a line of data.

给出示例文件，您可以执行以下操作:

Given your example file, you can do:

import re

class FlipFlop:
    ''' Class to imitate the bahavior of /start/, /end/ flip flop in awk '''
    def __init__(self, start_pattern, end_pattern):
        self.patterns = start_pattern, end_pattern
        self.state = False
    def __call__(self, st):
        ms=[e.search(st) for e in self.patterns]
        if all(m for m in ms):
            self.state = False
            return True
        rtr=True if self.state else False
        if ms[self.state]:
            self.state = not self.state
        return self.state or rtr

with open('/tmp/file') as f:
    ff=FlipFlop(re.compile('b bb'), re.compile('d dd'))
    print ''.join(line if ff(line) else "" for line in f)

打印:

bbb bb b
ccc cc c
ddd dd d

该行将保留逐行读取的文件，并具有/start/,/end/正则表达式在其他语言中的灵活性.当然，您可以对多行字符串(假设命名为s)执行相同的方法:

That retains a line-by-line file read with the flexibility of /start/,/end/ regex seen in other languages. Of course, you can do the same approach for a multiline string (assumed be named s):

''.join(line+"\n" if ff(line) else "" for line in s.splitlines())

习惯上，在awk中，使用标志可以获得与触发器相同的结果:

Idiomatically, in awk, you can get the same result as a flipflop using a flag:

$ awk '/b bb/{flag=1} flag{print $0} /d dd/{flag=0}' file

您也可以在Python中复制它(用更多的单词):

You can replicate that in Python as well (with more words):

flag=False
with open('file') as f:
    for line in f:
        if re.search(r'b bb', line):
            flag=True
        if flag:
            print(line.rstrip())
        if re.search(r'd dd', line):
            flag=False

这也可以与内存字符串一起使用.

Which can also be used with in memory string.

或者，您可以使用多行正则表达式:

Or, you can use a multi-line regex:

with open('/tmp/file') as f:
    print ''.join(re.findall(r'^.*b bb[\s\S]*d dd.*$', f.read(), re.M))

演示和说明

但是，这需要将整个文件读入内存.由于您声明字符串已读入内存，因此在这种情况下，这可能是最简单的:

But that requires reading the entire file into memory. Since you state the string has been read into memory, that is probably easiest in this case:

''.join(re.findall(r'^.*b bb[\s\S]*d dd.*$', s, re.M))

这篇关于Python:匹配之间的连续行类似于awk的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！