为什么NLTK Stemmer输出的词根数量与预期输出不同?

本文介绍了为什么NLTK Stemmer输出的词根数量与预期输出不同?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我必须对文本执行词干提取.问题如下:

I have to perform Stemming on a text. The questions are as follows :

标记所有在 tc 中给定的单词.该词应包含字母或数字或下划线.将标记的单词列表存储在 tw
将所有单词转换为小写.将结果存储到变量 tw
从唯一的一组 tw 中删除所有停用词.将结果存储到变量 fw
使用PorterStemmer对存在于 fw 中的每个单词进行词根分析，并将结果存储在列表中 psw

Tokenize all the words given in tc. The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tw
Convert all the words into lowercase. Store the result into the variable tw
Remove all the stop words from the unique set of tw. Store the result into the variable fw
Stem each word present in fw with PorterStemmer, and store the result in the list psw

下面是我的代码:

import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem  import PorterStemmer,LancasterStemmer

pattern = r'\w+';
tw= nltk.regexp_tokenize(tc,pattern);
tw= [word.lower() for word in tw];
stop_word = set(stopwords.words('english'));
fw= [w for w in tw if not w in stop_word];
#print(sorted(filteredwords));
porter = PorterStemmer();
psw = [porter.stem(word) for word in fw];
print(sorted(psw));

我的代码可以与所有提供的测试用例完美地结合使用，但仅在以下测试用例中失败，

My code works perfectly with all the provided testcases in hand-on but it fails only for the below test case where

我的输出是:

预期输出为:

寻求帮助以解决问题.

tc

为什么NLTK Stemmer输出的词根数量与预期输出不同?

问题描述

推荐答案