我在NLTK和Spacy上的以下句子中使用了NER,以下是结果:
"Zoni I want to find a pencil, a eraser and a sharpener"
我在Google Colab上运行了以下代码。
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
ex = "Zoni I want to find a pencil, a eraser and a sharpener"
def preprocess(sent):
sent = nltk.word_tokenize(sent)
sent = nltk.pos_tag(sent)
return sent
sent = preprocess(ex)
sent
#Output:
[('Zoni', 'NNP'),
('I', 'PRP'),
('want', 'VBP'),
('to', 'TO'),
('find', 'VB'),
('a', 'DT'),
('pencil', 'NN'),
(',', ','),
('a', 'DT'),
('eraser', 'NN'),
('and', 'CC'),
('a', 'DT'),
('sharpener', 'NN')]
但是当我在同一文本上使用spacy时,它没有返回任何结果
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
text = "Zoni I want to find a pencil, a eraser and a sharpener"
doc = nlp(text)
doc.ents
#Output:
()
它仅适用于某些句子。
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
# text = "Zoni I want to find a pencil, a eraser and a sharpener"
text = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'
doc = nlp(text)
doc.ents
#Output:
(European, Google, $5.1 billion, Wednesday)
请告诉我是否有问题。
最佳答案
空间模型是统计的。因此,这些模型可以识别的命名实体取决于训练这些模型的数据集。
根据spacy文档,命名实体是分配了名称的“现实对象”,例如,人物,国家/地区,产品或书名。
例如,名称Zoni并不常见,因此模型无法将其识别为命名实体(人)。如果我在您的句子保留语中将佐尼(Zoni)的名字改为威廉(William),请认出威廉是一个人。
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp('William I want to find a pencil, a eraser and a sharpener')
for entity in doc.ents:
print(entity.label_, ' | ', entity.text)
#output
PERSON | William
人们会假设铅笔,橡皮和卷笔刀是对象,因此它们有可能被归类为产品,因为spaa documentation指出“对象”是产品。但是,句子中的3个对象似乎并非如此。
我还注意到,如果在输入文本中找不到命名实体,那么输出将为空。
import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp('Zoni I want to find a pencil, a eraser and a sharpener')
if not doc.ents:
print ('No named entities were recognized in the input text.')
else:
for entity in doc.ents:
print(entity.label_, ' | ', entity.text)