问题描述
我一直在使用NLTK工具包.我经常遇到这个问题,并在网上搜索解决方案,但是没有一个令人满意的答案.因此,我将查询放在这里.
I have been playing with NLTK toolkit. I come across this problem a lot and searched for solution online but nowhere I got a satisfying answer. So I am putting my query here.
很多时候,NER不会将连续的NNP标记为一个NE.我认为编辑NER以使用RegexpTagger也可以改善NER.
Many times NER doesn't tag consecutive NNPs as one NE. I think editing the NER to use RegexpTagger also can improve the NER.
示例:
输入:
输出:
与
输入:
输出:
此处正确提取了副总裁/NNP,总裁/NNP(迪克/NNP,切尼/NNP).
Here Vice/NNP, President/NNP, (Dick/NNP, Cheney/NNP) , is correctly extracted.
因此,我认为如果首先使用nltk.ne_chunk,然后如果连续两棵树是NNP,则很有可能两者都引用一个实体.
So I think if nltk.ne_chunk is used first and then if two consecutive trees are NNP there are high chances that both refers to one entity.
任何建议将不胜感激.我正在寻找方法上的缺陷.
Any suggestion will be really appreciated. I am looking for flaws in my approach.
谢谢.
推荐答案
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
if continuous_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
return continuous_chunk
txt = "Barack Obama is a great person."
print get_continuous_chunks(txt)
[输出]:
['Barack Obama']
但是请注意,如果连续的块不应该是单个网元,那么您将把多个网元组合成一个网元.我想不出这样的例子,但我相信它会发生.但是,如果它们不是连续的,则上面的脚本可以正常工作:
But do note that if the continuous chunk are not supposed to be a single NE, then you would be combining multiple NEs into one. I can't think of such an example off my head but i'm sure it would happen. But if they not continuous, the script above works fine:
>>> txt = "Barack Obama is the husband of Michelle Obama."
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']
这篇关于具有正则表达式的命名实体识别:NLTK的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!