问题描述
我是NLTK和Python的新手.我一直在使用示例中给出的玩具语法来创建句子解析,但是我想知道是否可以使用从Penn Treebank的一部分中学到的语法,而不是仅仅编写自己的玩具或使用玩具语法? (我在Mac上使用的是Python 2.7)非常感谢
I'm fairly new to NLTK and Python. I've been creating sentence parses using the toy grammars given in the examples but I would like to know if it's possible to use a grammar learned from a portion of the Penn Treebank, say, as opposed to just writing my own or using the toy grammars? (I'm using Python 2.7 on Mac)Many thanks
推荐答案
如果您想要一种语法来精确捕获NLTK附带的Penn Treebank示例,则可以这样做,前提是您已经下载了NLTK的Treebank数据(参见下面的评论):
If you want a grammar that precisely captures the Penn Treebank sample that comes with NLTK, you can do this, assuming you've downloaded the Treebank data for NLTK (see comment below):
import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal
tbank_productions = set(production for sent in treebank.parsed_sents()
for production in sent.productions())
tbank_grammar = ContextFreeGrammar(Nonterminal('S'), list(tbank_productions))
但是,这可能不会给您有用的东西.由于NLTK仅支持在指定了所有终端的情况下使用语法进行解析,因此您将只能解析在Treebank示例中包含单词的句子.
This will probably not, however, give you something useful. Since NLTK only supports parsing with grammars with all the terminals specified, you will only be able to parse sentences containing words in the Treebank sample.
此外,由于树库中许多短语的结构扁平,因此该语法将很难很好地推广到训练中未包含的句子.这就是为什么尝试解析树库的NLP应用程序未使用从树库中学习CFG规则的方法.与之最接近的技术是Ren Bods的面向数据的解析方法,但它要复杂得多.
Also, because of the flat structure of many phrases in the Treebank, this grammar will generalize very poorly to sentences that weren't included in training. This is why NLP applications that have tried to parse the treebank have not used an approach of learning CFG rules from the Treebank. The closest technique to that would be the Ren Bods Data Oriented Parsing approach, but it is much more sophisticated.
最后,这将是如此之慢以至于毫无用处.因此,如果您想从一个句子中看到这种方法在语法上的作用,只是为了证明它是可行的,请尝试以下代码(在上面的导入之后):
Finally, this will be so unbelievably slow it's useless. So if you want to see this approach in action on the grammar from a single sentence just to prove that it works, try the following code (after the imports above):
mini_grammar = ContextFreeGrammar(Nonterminal('S'),
treebank.parsed_sents()[0].productions())
parser = nltk.parse.EarleyChartParser(mini_grammar)
print parser.parse(treebank.sents()[0])
这篇关于我如何使用python&从Penn Treebank获取一组语法规则NLTK?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!