问题描述
我编写了以下正则表达式来标记某些短语模式
pattern = """P2:{+?<JJ>* <NN>+ <VB>* <JJ>*}P1:{?<NN>+<CC>?<NN>* <VB>?<RB>* <JJ>+}P3:{}P4:{}"""
此模式将正确标记一个短语,例如:
a = '披萨很好,但意大利面很糟糕'
并用 2 个短语给出所需的输出:
- 披萨很好吃
- 意大利面很糟糕
但是,如果我的句子是这样的:
a = '披萨很棒而且很棒'
仅匹配短语:
'披萨棒极了'
而不是想要的:
'披萨很棒而且很棒'
如何在第二个示例中加入正则表达式模式?
首先我们来看看NLTK给出的POS标签:
>>>从 nltk 导入 pos_tag>>>sent = '披萨棒极了'.split()>>>pos_tag(发送)[('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('辉煌', 'JJ')]>>>sent = '披萨很好,但意大利面很糟糕'.split()>>>pos_tag(发送)[('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')](注意:以上是 NLTK v3.1 pos_tag
的输出,旧版本可能有所不同)
你想要捕捉的本质是:
- NN VBD JJ CC JJ
- NN VBD JJ
所以让我们用这些模式来捕捉它们:
>>>从 nltk 导入 RegexpParser>>>sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant']>>>sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad']>>>模式 = """... P:{<NN><VBD><JJ><CC><JJ>}... {<NN><VBD><JJ>}……">>>PChunker = RegexpParser(模式)>>>PChunker.parse(pos_tag(sent1))Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])>>>PChunker.parse(pos_tag(sent2))Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])这就是硬编码的欺骗"!!!
让我们回到 POS 模式:
- NN VBD JJ CC JJ
- NN VBD JJ
可以简化为:
- NN VBD JJ (CC JJ)
因此您可以在正则表达式中使用可选运算符,例如:
>>>模式 = """... P:{<NN><VBD><JJ>(<CC><JJ>)?}……">>>PChunker = RegexpParser(模式)>>>PChunker.parse(pos_tag(sent1))Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])>>>PChunker.parse(pos_tag(sent2))Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])很可能您使用的是旧标记器,这就是您的模式不同的原因,但我想您已经了解如何使用上面的示例捕获所需的短语.
步骤是:
- 首先,使用
pos_tag
检查POS模式是什么 - 然后概括模式并简化它们
- 然后将它们放入
RegexpParser
I have written the following regex to tag certain phrases pattern
pattern = """
P2: {<JJ>+ <RB>? <JJ>* <NN>+ <VB>* <JJ>*}
P1: {<JJ>? <NN>+ <CC>? <NN>* <VB>? <RB>* <JJ>+}
P3: {<NP1><IN><NP2>}
P4: {<NP2><IN><NP1>}
"""
This pattern would correctly tag a phrase such as:
a = 'The pizza was good but pasta was bad'
and give the desired output with 2 phrases:
- pizza was good
- pasta was bad
However, if my sentence is something like:
a = 'The pizza was awesome and brilliant'
matches only the phrase:
'pizza was awesome'
instead of the desired:
'pizza was awesome and brilliant'
How do I incorporate the regex pattern for my second example as well?
Firstly, let's take a look at the POS tags that NLTK gives:
>>> from nltk import pos_tag
>>> sent = 'The pizza was awesome and brilliant'.split()
>>> pos_tag(sent)
[('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')]
>>> sent = 'The pizza was good but pasta was bad'.split()
>>> pos_tag(sent)
[('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')]
(Note: The above are the outputs from NLTK v3.1 pos_tag
, older version might differ)
What you want to capture is essentially:
- NN VBD JJ CC JJ
- NN VBD JJ
So let's catch them with these patterns:
>>> from nltk import RegexpParser
>>> sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant']
>>> sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad']
>>> patterns = """
... P: {<NN><VBD><JJ><CC><JJ>}
... {<NN><VBD><JJ>}
... """
>>> PChunker = RegexpParser(patterns)
>>> PChunker.parse(pos_tag(sent1))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
>>> PChunker.parse(pos_tag(sent2))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
So that's "cheating" by hardcoding!!!
Let's go back to the POS patterns:
- NN VBD JJ CC JJ
- NN VBD JJ
Can be simplified to:
- NN VBD JJ (CC JJ)
So you can use the optional operators in the regex, e.g.:
>>> patterns = """
... P: {<NN><VBD><JJ>(<CC><JJ>)?}
... """
>>> PChunker = RegexpParser(patterns)
>>> PChunker.parse(pos_tag(sent1))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
>>> PChunker.parse(pos_tag(sent2))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
Most probably you're using the old tagger, that's why your patterns are different but I guess you see how you could capture the phrases you need using the example above.
The steps are:
- First, check what is the POS patterns using the
pos_tag
- Then generalize patterns and simplify them
- Then put them into the
RegexpParser
这篇关于如何使用 nltk 正则表达式模式提取特定的短语块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!