在Stanford Dependency Manual中,他们提到“stanford类型的依赖项”,特别是“neg”类型的否定修饰符。在使用网站使用斯坦福增强的++解析器时,也可以使用它例如,句子:
“巴拉克奥巴马不是在夏威夷出生的”
解析器确实找到了neg(born,not)
但是当我使用stanfordnlp
python库时,我能得到的唯一依赖性解析器将按如下方式解析该语句:
('Barack', '5', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '5', 'aux:pass')
('not', '5', 'advmod')
('born', '0', 'root')
('in', '7', 'case')
('Hawaii', '5', 'obl')
以及生成它的代码:
import stanfordnlp
stanfordnlp.download('en')
nlp = stanfordnlp.Pipeline()
doc = nlp("Barack Obama was not born in Hawaii")
a = doc.sentences[0]
a.print_dependencies()
有没有一种方法可以得到与增强的依赖关系解析器或任何其他斯坦福解析器相似的结果,从而产生类型依赖关系,从而给我取反修饰符?
最佳答案
需要注意的是,python库stanfordnlp不仅仅是stanfordcorenlp的python包装器。
一。差异stanfordnlp/corenlp
正如stanfordnlp Github repo上所说:
斯坦福大学nlp小组的官方python nlp库。它包含
运行我们最新的完全神经管道的软件包
2018共享任务,用于访问Java Stanford Corenlp服务器。
stanfordnlp包含一组新的神经网络模型,在conll 2018共享任务上进行训练。在线解析器基于CoreNLP 3.9.2java库如here所述,这是两种不同的管道和一组模型。
你的代码只访问他们在conll 2018数据上训练的神经管道。这就解释了你所看到的与在线版本的不同之处。基本上是两种不同的模式。
更让我困惑的是,这两个存储库都属于名为stanfordnlp(团队名称)的用户。不要在java stanfordnlp/corenlp和python stanfordnlp/stanfordnlp之间受骗。
关于您的“neg”问题,似乎在python libabry stanfordnlp中,他们决定考虑使用“advmod”注释的否定。至少这是我遇到的几个例句。
2.通过stanfordnlp包使用corenlp
但是,您仍然可以通过stanfordnlp包访问corenlp。不过,这还需要一些步骤。引用github回购协议,
有几个初始设置步骤。
下载你想要使用的语言的斯坦福corenlp和models。(you can download CoreNLP and the language models here)
将模型jar放在分发文件夹中
告诉python代码Stanford CoreNLP的位置:export CoreNLP_HOME=/path/to/Stanford-CoreNLP-full-2018-10-05
完成后,您可以使用demo中的代码启动客户机:
from stanfordnlp.server import CoreNLPClient
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
# get the first sentence
sentence = ann.sentence[0]
# get the dependency parse of the first sentence
print('---')
print('dependency parse of first sentence')
dependency_parse = sentence.basicDependencies
print(dependency_parse)
#get the tokens of the first sentence
#note that 1 token is 1 node in the parse tree, nodes start at 1
print('---')
print('Tokens of first sentence')
for token in sentence.token :
print(token)
因此,如果指定了“deparse”注释器(以及tokenize、ssplit和pos的必备注释器),则将解析您的句子。
看了演示,感觉我们只能访问基本的依赖项。我没有通过stanfordnlp使增强的++依赖项工作。
但是,如果使用basicdependencies,则仍然会出现否定!
这是我使用stanfordnlp和您的示例语句获得的输出。它是一个dependencyGraph对象,并不漂亮,但不幸的是,当我们使用非常深入的corenLP工具时,总是这样。您将看到,在节点4和节点5(“not”和“born”)之间,有和edge“neg”。
node {
sentenceIndex: 0
index: 1
}
node {
sentenceIndex: 0
index: 2
}
node {
sentenceIndex: 0
index: 3
}
node {
sentenceIndex: 0
index: 4
}
node {
sentenceIndex: 0
index: 5
}
node {
sentenceIndex: 0
index: 6
}
node {
sentenceIndex: 0
index: 7
}
node {
sentenceIndex: 0
index: 8
}
edge {
source: 2
target: 1
dep: "compound"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 2
dep: "nsubjpass"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 3
dep: "auxpass"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 4
dep: "neg"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 7
dep: "nmod"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 8
dep: "punct"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 7
target: 6
dep: "case"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
root: 5
---
Tokens of first sentence
word: "Barack"
pos: "NNP"
value: "Barack"
before: ""
after: " "
originalText: "Barack"
beginChar: 0
endChar: 6
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false
word: "Obama"
pos: "NNP"
value: "Obama"
before: " "
after: " "
originalText: "Obama"
beginChar: 7
endChar: 12
tokenBeginIndex: 1
tokenEndIndex: 2
hasXmlContext: false
isNewline: false
word: "was"
pos: "VBD"
value: "was"
before: " "
after: " "
originalText: "was"
beginChar: 13
endChar: 16
tokenBeginIndex: 2
tokenEndIndex: 3
hasXmlContext: false
isNewline: false
word: "not"
pos: "RB"
value: "not"
before: " "
after: " "
originalText: "not"
beginChar: 17
endChar: 20
tokenBeginIndex: 3
tokenEndIndex: 4
hasXmlContext: false
isNewline: false
word: "born"
pos: "VBN"
value: "born"
before: " "
after: " "
originalText: "born"
beginChar: 21
endChar: 25
tokenBeginIndex: 4
tokenEndIndex: 5
hasXmlContext: false
isNewline: false
word: "in"
pos: "IN"
value: "in"
before: " "
after: " "
originalText: "in"
beginChar: 26
endChar: 28
tokenBeginIndex: 5
tokenEndIndex: 6
hasXmlContext: false
isNewline: false
word: "Hawaii"
pos: "NNP"
value: "Hawaii"
before: " "
after: ""
originalText: "Hawaii"
beginChar: 29
endChar: 35
tokenBeginIndex: 6
tokenEndIndex: 7
hasXmlContext: false
isNewline: false
word: "."
pos: "."
value: "."
before: ""
after: ""
originalText: "."
beginChar: 35
endChar: 36
tokenBeginIndex: 7
tokenEndIndex: 8
hasXmlContext: false
isNewline: false
2.通过nltk包使用corenlp
我将不详细介绍这一点,但也有一个解决方案,通过NLTK库访问CoreNLP服务器,如果所有其他失败它确实输出了否定,但是需要更多的工作来启动服务器。
this page的详细信息
编辑
我想我还可以与您共享代码,以使DependencyGraph以类似于stanfordnlp输出的形状成为一个很好的“dependency,argument1,argument2”列表。
from stanfordnlp.server import CoreNLPClient
text = "Barack Obama was not born in Hawaii."
# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
# get the first sentence
sentence = ann.sentence[0]
# get the dependency parse of the first sentence
dependency_parse = sentence.basicDependencies
#print(dir(sentence.token[0])) #to find all the attributes and methods of a Token object
#print(dir(dependency_parse)) #to find all the attributes and methods of a DependencyGraph object
#print(dir(dependency_parse.edge))
#get a dictionary associating each token/node with its label
token_dict = {}
for i in range(0, len(sentence.token)) :
token_dict[sentence.token[i].tokenEndIndex] = sentence.token[i].word
#get a list of the dependencies with the words they connect
list_dep=[]
for i in range(0, len(dependency_parse.edge)):
source_node = dependency_parse.edge[i].source
source_name = token_dict[source_node]
target_node = dependency_parse.edge[i].target
target_name = token_dict[target_node]
dep = dependency_parse.edge[i].dep
list_dep.append((dep,
str(source_node)+'-'+source_name,
str(target_node)+'-'+target_name))
print(list_dep)
它包括以下内容
[('compound', '2-Obama', '1-Barack'), ('nsubjpass', '5-born', '2-Obama'), ('auxpass', '5-born', '3-was'), ('neg', '5-born', '4-not'), ('nmod', '5-born', '7-Hawaii'), ('punct', '5-born', '8-.'), ('case', '7-Hawaii', '6-in')]