问题描述
- 多行字符串
string
(已经从文件file
中读取) - 两个模式
pattern1
和pattern2
,它们将分别匹配string
中仅一行的子字符串.这些行将称为line1和line2.
- A multiline string
string
(already read from a filefile
) - Two patterns
pattern1
andpattern2
which will match a substring of exactly one line instring
each. These lines will be called line1 and line2.
这些模式是正则表达式模式,但如果可以简化,我可以更改其格式.
The patterns are regex-patterns, but I can change their format if that makes it easier.
我正在寻找一种方法来获取python中line1和line2之间的所有行(我们可以放心地假设line1在line2之前).
I am looking for a way to get all the lines between line1 and line2 in python (we can safely assume that line1 is before line2).
当然可以在for循环中使用pattern1
设置的标志并在pattern2
匹配时中断来完成此操作.不过,我在这里寻找更紧凑的解决方案.这是awk
中的琐碎的oneliner:
Of course this could be done in a for loop with a flag set by pattern1
and a break when pattern2
matches. I am looking for a more compact solution here, though. This is a trivial oneliner in awk
:
awk '/pattern1/,/pattern2/' file
示例:
文件:
aaa aa a
bbb bb b
ccc cc c
ddd dd d
eee ee e
fff ff f
模式1:b bb
pattern2:d dd
pattern2: d dd
所需结果:
bbb bb b
ccc cc c
ddd dd d
推荐答案
在awk
中,/start/, /end/
范围正则表达式将打印找到/start/
的整行,直到并包括模式已找到.这是一个有用的构造,并已被Perl,sed,Ruby和其他人复制.
In awk
the /start/, /end/
range regex prints the entire line that the /start/
is found in up to and including the entire line where the /end/
pattern is found. It is a useful construct and has been copied by Perl, sed, Ruby and others.
要在Python中执行范围运算符,请编写一个类来跟踪对start
运算符直到end
运算符的上一次调用的状态.我们可以使用正则表达式(如awk
一样),也可以对其进行琐碎的修改以使其返回一行数据的True
或False
状态.
To do a range operator in Python, write a class that keeps track of the state of the previous call to the start
operator until the end
operator. We can use a regex (as awk
does) or this can be trivially modified to anything returning a True
or False
status for a line of data.
给出示例文件,您可以执行以下操作:
Given your example file, you can do:
import re
class FlipFlop:
''' Class to imitate the bahavior of /start/, /end/ flip flop in awk '''
def __init__(self, start_pattern, end_pattern):
self.patterns = start_pattern, end_pattern
self.state = False
def __call__(self, st):
ms=[e.search(st) for e in self.patterns]
if all(m for m in ms):
self.state = False
return True
rtr=True if self.state else False
if ms[self.state]:
self.state = not self.state
return self.state or rtr
with open('/tmp/file') as f:
ff=FlipFlop(re.compile('b bb'), re.compile('d dd'))
print ''.join(line if ff(line) else "" for line in f)
打印:
bbb bb b
ccc cc c
ddd dd d
该行将保留逐行读取的文件,并具有/start/,/end/
正则表达式在其他语言中的灵活性.当然,您可以对多行字符串(假设命名为s
)执行相同的方法:
That retains a line-by-line file read with the flexibility of /start/,/end/
regex seen in other languages. Of course, you can do the same approach for a multiline string (assumed be named s
):
''.join(line+"\n" if ff(line) else "" for line in s.splitlines())
习惯上,在awk中,使用标志可以获得与触发器相同的结果:
Idiomatically, in awk, you can get the same result as a flipflop using a flag:
$ awk '/b bb/{flag=1} flag{print $0} /d dd/{flag=0}' file
您也可以在Python中复制它(用更多的单词):
You can replicate that in Python as well (with more words):
flag=False
with open('file') as f:
for line in f:
if re.search(r'b bb', line):
flag=True
if flag:
print(line.rstrip())
if re.search(r'd dd', line):
flag=False
这也可以与内存字符串一起使用.
Which can also be used with in memory string.
或者,您可以使用多行正则表达式:
Or, you can use a multi-line regex:
with open('/tmp/file') as f:
print ''.join(re.findall(r'^.*b bb[\s\S]*d dd.*$', f.read(), re.M))
但是,这需要将整个文件读入内存.由于您声明字符串 已读入内存,因此在这种情况下,这可能是最简单的:
But that requires reading the entire file into memory. Since you state the string has been read into memory, that is probably easiest in this case:
''.join(re.findall(r'^.*b bb[\s\S]*d dd.*$', s, re.M))
这篇关于Python:匹配之间的连续行类似于awk的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!