python - nltk StanfordNERTagger : How to get proper nouns without capitalization

我正在尝试使用StanfordNERTagger和nltk从一段文本中提取关键字。

docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics."

words = re.split("\W+",docText)

stops = set(stopwords.words("english"))

    #remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]

str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']

print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged

这给了我

John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics
Stanford POS Tagged
[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term']
[(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]

很清楚，像Short和Term这样的东西被标记为NNP。我拥有的数据包含许多此类实例，其中非NNP单词大写为。这可能是由于错别字或它们是标题。我对此没有太多控制权。

我如何解析或清理数据，以便即使它可以大写也可以检测到非NNP术语？ 我不希望像Short和Term这样的术语归类为NNP

另外，不确定为什么John Donk被捕获为个人，而Brian Jones没有被捕获。可能是由于我数据中的其他大写非NNP导致的吗？这会对StanfordNERTagger如何对待其他所有内容产生影响吗？

更新，一种可能的解决方案

这是我计划要做的

取每个单词并转换为小写

标记小写单词

如果标签是NNP，那么我们知道原始单词也必须是NNP

如果不是，则原始单词的大小写错误

这是我试图做的

str = " ".join(words)
print str
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
for word in str.split():
    wl = word.lower()
    print wl
    w,pos = stp.tag(wl)
    print pos
    if pos=="NNP":
        print "Got NNP"
        print w

但这给我错误

John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics
john
Traceback (most recent call last):
  File "X:\crp.py", line 37, in <module>
    w,pos = stp.tag(wl)
ValueError: too many values to unpack

我尝试了多种方法，但总是会出现一些错误。 如何标记一个单词？

我不想将整个字符串转换为小写，然后标记。如果我这样做，StanfordPOSTagger返回一个空字符串

最佳答案

首先，请参见另一个问题，以设置要从命令行或python调用的Stanford CoreNLP:nltk : How to prevent stemming of proper nouns。

对于适当的大小写句子，我们看到NER可以正常工作:

>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. '
... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics')
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner',  'outputFormat': 'json'})
>>> annotated_sent0 = output['sentences'][0]
>>> annotated_sent1 = output['sentences'][1]
>>> for token in annotated_sent0['tokens']:
...     print token['word'], token['lemma'], token['pos'], token['ner']
...
John John NNP PERSON
Donk Donk NNP PERSON
works work VBZ O
POI POI NNP ORGANIZATION
Jones Jones NNP ORGANIZATION
wants want VBZ O
meet meet VB O
Xyz Xyz NNP ORGANIZATION
Corp Corp NNP ORGANIZATION
measuring measure VBG O
POI poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
. . . O

对于小写的句子，您将不会获得POS标签或任何NER标签的NNP:

>>> for token in annotated_sent1['tokens']:
...     print token['word'], token['lemma'], token['pos'], token['ner']
...
john john NN O
donk donk JJ O
works work NNS O
poi poi VBP O
jones jone NNS O
wants want VBZ O
meet meet VB O
xyz xyz NN O
corp corp NN O
measuring measure VBG O
poi poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O

因此，您的问题应该是:

您的NLP应用程序的最终目标是什么？

为什么输入的内容小写？这是您的工作还是提供数据的方式？

回答完这些问题后，您可以继续确定您真正想要使用NER标签执行的操作，即

如果输入是小写字母，并且是由于您如何构造NLP工具链，那么

不要这样做!!! 对普通文本执行NER，而不会造成您创建的变形。这是因为NER受过普通文本训练，因此在普通文本的上下文中不会真正起作用。

也请不要将其与其他套件中的NLP工具混用，它们通常不会很好地发挥作用，尤其是在您的NLP工具链末尾

如果输入是小写的，因为这就是原始数据的格式，则:

注释一小部分数据，或查找小写的注释数据，然后重新训练模型。

解决该问题并训练带有普通文本的truecaser，然后将truecasing模型应用于小写字母的文本。参见https://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf

如果输入的大小写错误，例如`一些大一些小但并非全部都是专有名词，然后

也尝试使用truecasing解决方案。

关于python - nltk StanfordNERTagger : How to get proper nouns without capitalization，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/34439208/