python - 查找两个子字符串之间的字符串，以及字符串和文件末尾之间的字符串

我有以下问题。我想从多个文本文件中获取特定的字符串，文本文件中有某种模式。例如

example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"

每个文件都非常不同，但是在所有文件中我都需要文本1：在“ Pear”和“ Apple”这两个词之间，我使用以下代码解决了这个问题：

x = re.findall(r'Pear+\s(.*?)Apple', example_file ,re.DOTALL)

['this should be included1 ', 'this should be included2 ']

我找不到的想法是，我也希望字符串结尾，“ this is included3”部分。所以我想知道是否有一种方法可以用正则表达式指定

 x = re.findall(r'Pear+\s(.*?)Apple OR EOF', example_file ,re.DOTALL)

那么如何匹配单词'Pear'和EOF（文件结尾）呢？请注意，这些都是文本文件（因此不是一个句子）

最佳答案

选择Apple或$（与字符串结尾匹配的锚点）：

x = re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)

|指定两个替代方案，并且(?:...)是一个非捕获组，因此解析器知道选择Apple还是$作为匹配项。

请注意，我用Pear+\s替换了Pear\s+，因为我怀疑您想匹配任意空格，而不是任意数量的r字符。

演示：

>>> import re
>>> example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"
>>> re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)
['this should be included1 ', 'this should be included2 ', 'this should be included3']