尝试解析多行文档中的多个选择。想要捕获每个关键字之间的所有行。这是一个例子:

Keyword 1: CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4


我可能也有

Keyword 1: CAPTURE THIS TEXT
           CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4


我的代码看起来像

from pyparsing import *

EOL = LineEnd().suppress()
line = OneOrMore(Group(SkipTo(LineEnd()) + EOL))

KEYWORD_CAPTURE_AREA = Keyword("Keyword 1:").suppress() + line + Keyword("Keyword 2:").suppress() + line \
                    + Keyword("Keyword 3:").suppress() + line + Keyword("Keyword 4").suppress()


如果我的结果跨越多个行,则当前方法不会返回任何结果。假设应该对此有一个简单的解决方案-只是没有找到它。

最佳答案

通过pyparsing学习的概念是每个子表达式都独立运行,而不知道任何包含或跟随的表达式。因此,当您的line要匹配一个或多个“跳至当前行的末尾”时,它不知道在看到下一个“关键字”字符串时应该停止,因此可以预见地读到末尾的字符串。然后,当解析器继续查找“关键字2”时,它已经远远超过了这一点,因此引发了异常。

您需要告诉OneOrMore,如果它在行的开头找到“关键字”,它应该停止解析,即使它通常与重复的表达式匹配也是如此。如果在行首找到一个合理的块尾检测字,则可能是单词“关键字”。 (您可以使其更详细,并匹配"Keyword" + integer + ":"使其真正防弹。)让我们将其称为“ start_of_block_marker”:

start_of_block_marker = LineStart() + "Keyword"


要告诉OneOrMore这表明其重复停止条件,请将此表达式作为stopOn参数传递:

line = OneOrMore(Group(SkipTo(LineEnd()) + EOL),
                 stopOn=LineStart() + "Keyword")


现在,这将解析所有字符串,但是当我认为您确实希望将所有子字符串归为一个组时,您将在OneOrMore中进行分组。同样,介于2和3之间的空行会创建一个额外的空行。这是line的改进版本:

line = Optional(EOL) + Group(OneOrMore(SkipTo(LineEnd()) + EOL,
                             stopOn=LineStart() + "Keyword"))


我将您的两个测试字符串放在列表中,然后将其用作runTests()的参数:

text1 = """\
Keyword 1: CAPTURE THIS TEXT
           CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4"""

text2 = """\
Keyword 1: CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4
"""
KEYWORD_CAPTURE_AREA.runTests(tests)


哪个打印(回显每个测试,然后打印解析的结果):

Keyword 1: CAPTURE THIS TEXT
           CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4
[['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']]
[0]:
  ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']
[1]:
  ['CAPTURE THIS TEXT']
[2]:
  ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']


Keyword 1: CAPTURE THIS TEXT
Keyword 2: CAPTURE THIS TEXT

Keyword 3:
CAPTURE THIS TEXT
CAPTURE THIS TEXT
CAPTURE THIS TEXT

Keyword 4

[['CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT'], ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']]
[0]:
  ['CAPTURE THIS TEXT']
[1]:
  ['CAPTURE THIS TEXT']
[2]:
  ['CAPTURE THIS TEXT', 'CAPTURE THIS TEXT', 'CAPTURE THIS TEXT']


如果结果中有错误,runTests()将显示问题行和位置,并给出pyparsing错误消息。

关于python - 使用pyparsing捕获多行块,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/55909620/

10-12 21:38