我正在使用 python、NLTK 和 WordNetLemmatizer 开发 lemmatizer。
这是一个随机文本,输出我所期望的

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective

输出:'bad'
lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb

输出:'worse'
嗯,这里一切都很好。行为与其他形容词相同,如 'better'(用于不规则形式)或 'older'(注意,与 'elder' 相同的测试永远不会输出 'old' ,但我猜 wordnet 不是所有现有英语单词的详尽列表)

尝试使用 'furter' 一词时,我的问题出现了:
lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective

输出:'further'
lem.lemmatize('further', pos=wordnet.ADV) // as an adverb

输出:'far'
这与 'worse' 字的行为完全相反!

谁能解释我为什么?它是来自 wordnet 同义词集数据的错误还是来自我对英语语法的误解?

如果问题已经得到解答,请原谅我,我已经在 google 和 SO 上搜索过,但是当指定关键字“进一步”时,由于这个词的流行,我可以找到任何相关但困惑的东西......

先感谢您,
罗曼 G。

最佳答案

WordNetLemmatizer 使用 ._morphy 函数访问其词的引理;从 http://www.nltk.org/_modules/nltk/stem/wordnet.html 并返回具有最小长度的可能引理。

def lemmatize(self, word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word
._morphy 函数迭代地应用规则以获得引理;规则不断减少单词的长度并用 MORPHOLOGICAL_SUBSTITUTIONS 替换词缀。然后它查看是否还有其他更短但与缩减词相同的词:
def _morphy(self, form, pos):
    # from jordanbg:
    # Given an original string x
    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    # 2. Return all that are in the database
    # 3. If there are no matches, keep applying rules until you either
    #    find a match or you can't go any further

    exceptions = self._exception_map[pos]
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]

    def apply_rules(forms):
        return [form[:-len(old)] + new
                for form in forms
                for old, new in substitutions
                if form.endswith(old)]

    def filter_forms(forms):
        result = []
        seen = set()
        for form in forms:
            if form in self._lemma_pos_offset_map:
                if pos in self._lemma_pos_offset_map[form]:
                    if form not in seen:
                        result.append(form)
                        seen.add(form)
        return result

    # 0. Check the exception lists
    if form in exceptions:
        return filter_forms([form] + exceptions[form])

    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    forms = apply_rules([form])

    # 2. Return all that are in the database (and check the original too)
    results = filter_forms([form] + forms)
    if results:
        return results

    # 3. If there are no matches, keep applying rules until we find a match
    while forms:
        forms = apply_rules(forms)
        results = filter_forms(forms)
        if results:
            return results

    # Return an empty list if we can't find anything
    return []

但是,如果单词在异常列表中,它将返回一个固定值保存在 exceptions 中,请参阅 http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html 中的 _load_exception_map :
def _load_exception_map(self):
    # load the exception file data into memory
    for pos, suffix in self._FILEMAP.items():
        self._exception_map[pos] = {}
        for line in self.open('%s.exc' % suffix):
            terms = line.split()
            self._exception_map[pos][terms[0]] = terms[1:]
    self._exception_map[ADJ_SAT] = self._exception_map[ADJ]

回到你的例子, worse -> badfurther -> far 不能从规则中实现,因此它必须来自异常(exception)列表。既然是异常(exception) list ,肯定会有不一致的地方。

异常列表保存在 ~/nltk_data/corpora/wordnet/adv.exc~/nltk_data/corpora/wordnet/adv.exc 中。

adv.exc :
best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard

adj.exc :
...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...

关于Python NLTK 使用 wordnet 对 'further' 一词进行词形还原,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/22999273/

10-12 18:26