词义消除歧义NLP项目实验

本项目主要使用https://github.com/alvations/pywsd 中的pywsd库来实现词义消除歧义

目前，该库一部分已经移植到了nltk中，为了获得更好的性能WSD，而不是使用的NLTK模块pywsd库。一般来说，从pywsd的simple_lesk()比NLTK的lesk好。当我有空时，我会尽量更新NLTK模块。在本文档中主要介绍原pywsd库的使用。

一、使用的技术：

Lesk 算法
- Original Lesk (Lesk, 1986)
- Adapted/Extended Lesk (Banerjee and Pederson, 2002/2003)
- Simple Lesk (with definition, example(s) and hyper+hyponyms)
- Cosine Lesk (use cosines to calculate overlaps instead of using raw counts)
最大化相似度 (see also, Pedersen et al. (2003))
- Path similarity (Wu-Palmer, 1994; Leacock and Chodorow, 1998)
- Information Content (Resnik, 1995; Jiang and Corath, 1997; Lin, 1998)
基线
- Random sense
- First NLTK sense
- Highest lemma counts

二、使用方法：

安装：

pip install -U nltk
python -m nltk.downloader 'popular'
pip install -U pywsd

使用：

from pywsd.lesk import simple_lesk   #引入pywsd库
sent = 'I went to the bank to deposit my money'  #设定包含具有多义的词的句子
ambiguous = 'bank'              #设定多义的词语
answer = simple_lesk(sent, ambiguous, pos='n')   #设置answer的参数，将句子与词进行判断
print (answer.definition())         #打印出答案

三、原理

词义消岐，英文名称为Word Sense Disambiguation，英语缩写为WSD，LESK算法是词义消歧的主要算法。

LESK算法是以一种以TF-IDF为权重的频数判别算法，主要流程可以简述为：

去掉停用词
统计出该词以外的TF-IDF值
累加起来，比较多个义项下这个值的大小，值越大说明是该句子的义项

下面以NBA火箭队为示例来简要实现一下lesk算法：

import os
import jieba
from math import log2

# 读取每个义项的语料
def read_file(path):
    with open(path, 'r', encoding='utf-8') as f:
        lines = [_.strip() for _ in f.readlines()]
        return lines

# 对示例句子分词
sent = '赛季初的时候，火箭是众望所归的西部决赛球队。'
wsd_word = '火箭'

jieba.add_word(wsd_word)
sent_words = list(jieba.cut(sent, cut_all=False))

# 去掉停用词
stopwords = [wsd_word, '我', '你', '它', '他', '她', '了', '是', '的', '啊', '谁', '什么','都',\
             '很', '个', '之', '人', '在', '上', '下', '左', '右', '。', '，', '！', '？']

sent_cut = []
for word in sent_words:
    if word not in stopwords:
        sent_cut.append(word)

print(sent_cut)


# 计算其他词的TF-IDF以及频数
wsd_dict = {}
for file in os.listdir('.'):
    if wsd_word in file:
        wsd_dict[file.replace('.txt', '')] = read_file(file)

# 统计每个词语在语料中出现的次数
tf_dict = {}
for meaning, sents in wsd_dict.items():
    tf_dict[meaning] = []
    for word in sent_cut:
        word_count = 0
        for sent in sents:
            example = list(jieba.cut(sent, cut_all=False))
            word_count += example.count(word)

        if word_count:
            tf_dict[meaning].append((word, word_count))

idf_dict = {}
for word in sent_cut:
    document_count = 0
    for meaning, sents in wsd_dict.items():
        for sent in sents:
            if word in sent:
                document_count += 1

    idf_dict[word] = document_count

# 输出值
total_document = 0
for meaning, sents in wsd_dict.items():
    total_document += len(sents)

# 计算tf_idf值
mean_tf_idf = []
for k, v in tf_dict.items():
    print(k+':')
    tf_idf_sum = 0
    for item in v:
        word = item[0]
        tf = item[1]
        tf_idf = item[1]*log2(total_document/(1+idf_dict[word]))
        tf_idf_sum += tf_idf
        print('%s, 频数为: %s, TF-IDF值为: %s'% (word, tf, tf_idf))

    mean_tf_idf.append((k, tf_idf_sum))

sort_array = sorted(mean_tf_idf, key=lambda x:x[1], reverse=True)
true_meaning = sort_array[0][0].split('_')[1]
print('\n经过词义消岐，%s在该句子中的意思为 %s .' % (wsd_word, true_meaning))

结果如下：

['赛季', '初', '时候', '众望所归', '西部', '决赛', '球队']
火箭_燃气推进装置:
初, 频数为: 2, TF-IDF值为: 12.49585502688717
火箭_NBA球队名:
赛季, 频数为: 63, TF-IDF值为: 204.6194333469459
初, 频数为: 1, TF-IDF值为: 6.247927513443585
时候, 频数为: 1, TF-IDF值为: 8.055282435501189
西部, 频数为: 16, TF-IDF值为: 80.88451896801904
决赛, 频数为: 7, TF-IDF值为: 33.13348038429679
球队, 频数为: 40, TF-IDF值为: 158.712783770034


经过词义消岐，火箭在该句子中的意思为 NBA球队名 .

又如：

输入句子：三十多年前，战士们在戈壁滩白手起家，建起了我国的火箭发射基地。

['三十多年', '前', '战士', '们', '戈壁滩', '白手起家', '建起', '我国', '发射', '基地']
火箭_燃气推进装置:
前, 频数为: 2, TF-IDF值为: 9.063440958888354
们, 频数为: 1, TF-IDF值为: 6.05528243550119
我国, 频数为: 3, TF-IDF值为: 22.410959804340102
发射, 频数为: 89, TF-IDF值为: 253.27878721862933
基地, 频数为: 7, TF-IDF值为: 42.38697704850833
火箭_NBA球队名:
前, 频数为: 3, TF-IDF值为: 13.59516143833253
们, 频数为: 1, TF-IDF值为: 6.05528243550119

经过词义消岐，火箭在该句子中的意思为 燃气推进装置 .

概述：输入的文段或者句子，之后，将分割好的该词的释义进行分割，形成几个词。而后，在每个文段和句子中计算被分割词的个数，然后算出TF-IDF的值，计算哪个TF-IDF的值最大，为更适用于该释义。

四、改进

对于代码本身，可以做到一点点进步的优化，算法上的优化可以做到更大的跨越，如http://www.doc88.com/p-9959426974439.html这篇文章提到的lesk算法的改进。

对于lesk算法的缺点，释义的判断很容易被相同TF-IDF的值误扰，即权值相同的情况。