我有以下问题。我想从多个文本文件中获取特定的字符串,文本文件中有某种模式。例如

example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"


每个文件都非常不同,但是在所有文件中我都需要文本1:在“ Pear”和“ Apple”这两个词之间,我使用以下代码解决了这个问题:

x = re.findall(r'Pear+\s(.*?)Apple', example_file ,re.DOTALL)


返回:

['this should be included1 ', 'this should be included2 ']


我找不到的想法是,我也希望字符串结尾,“ this is included3”部分。所以我想知道是否有一种方法可以用正则表达式指定

 x = re.findall(r'Pear+\s(.*?)Apple OR EOF', example_file ,re.DOTALL)


那么如何匹配单词'Pear'和EOF(文件结尾)呢?请注意,这些都是文本文件(因此不是一个句子)

最佳答案

选择Apple$(与字符串结尾匹配的锚点):

x = re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)


|指定两个替代方案,并且(?:...)是一个非捕获组,因此解析器知道选择Apple还是$作为匹配项。

请注意,我用Pear+\s替换了Pear\s+,因为我怀疑您想匹配任意空格,而不是任意数量的r字符。

演示:

>>> import re
>>> example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"
>>> re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)
['this should be included1 ', 'this should be included2 ', 'this should be included3']

10-07 13:35
查看更多