我有以下问题。我想从多个文本文件中获取特定的字符串,文本文件中有某种模式。例如
example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"
每个文件都非常不同,但是在所有文件中我都需要文本1:在“ Pear”和“ Apple”这两个词之间,我使用以下代码解决了这个问题:
x = re.findall(r'Pear+\s(.*?)Apple', example_file ,re.DOTALL)
返回:
['this should be included1 ', 'this should be included2 ']
我找不到的想法是,我也希望字符串结尾,“ this is included3”部分。所以我想知道是否有一种方法可以用正则表达式指定
x = re.findall(r'Pear+\s(.*?)Apple OR EOF', example_file ,re.DOTALL)
那么如何匹配单词'Pear'和EOF(文件结尾)呢?请注意,这些都是文本文件(因此不是一个句子)
最佳答案
选择Apple
或$
(与字符串结尾匹配的锚点):
x = re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)
|
指定两个替代方案,并且(?:...)
是一个非捕获组,因此解析器知道选择Apple
还是$
作为匹配项。请注意,我用
Pear+\s
替换了Pear\s+
,因为我怀疑您想匹配任意空格,而不是任意数量的r
字符。演示:
>>> import re
>>> example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"
>>> re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)
['this should be included1 ', 'this should be included2 ', 'this should be included3']