regex - 具有正则表达式的命名实体识别: NLTK

我一直在使用NLTK工具包。我经常遇到这个问题，并在网上搜索解决方案，但是没有一个令人满意的答案。因此，我将查询放在这里。

很多时候，NER不会将连续的NNP标记为一个NE。我认为编辑NER以使用RegexpTagger也可以改善NER。

例子:

输入:

输出:

然而

输入:

输出:

此处正确提取了Vice/NNP，President/NNP(Dick/NNP，Cheney/NNP)。

因此，我认为如果首先使用nltk.ne_chunk，然后如果两个连续的树是NNP，则很有可能两者都引用一个实体。

任何建议将不胜感激。我正在寻找方法上的缺陷。

谢谢。

最佳答案

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Barack Obama is a great person."
print get_continuous_chunks(txt)

[出去]:

['Barack Obama']

但是请注意，如果连续的块不应该是单个网元，则将多个网元合并为一个。我想不出这样的例子，但我相信它会发生。但是，如果它们不是连续的，则上面的脚本可以正常工作:

>>> txt = "Barack Obama is the husband of Michelle Obama."
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']