我将lemmatization添加到了我的countvectorizer中,如对此Sklearn page所述。

from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,
                       strip_accents = 'unicode',
                       stop_words = 'english',
                       lowercase = True,
                       token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
                       max_df = 0.5,
                       min_df = 10)

但是,当使用fit_transform创建dtm时,出现以下错误(我没有道理)。在将词条化添加到我的向量化器之前,dtm代码始终有效。我更深入地阅读了手册,并尝试了一些代码,但找不到任何解决方案。
dtm_tf = tf_vectorizer.fit_transform(articles)

更新:

在遵循@MaxU的建议之后,代码运行无误,但是我的输出未忽略数字和标点符号。我运行单个测试以查看LemmaTokenizer()之后的其他功能中哪些仍有效。结果如下:
strip_accents = 'unicode', # works
stop_words = 'english', # works
lowercase = True, # works
token_pattern = r'\b[a-zA-Z]{3,}\b', # does not work
max_df = 0.5, # works
min_df = 10 # works

显然,只是token_pattern变为非事件状态。这是没有token_pattern的更新的工作代码(我只需要先安装'punkt'和'wordnet'软件包):
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
                                strip_accents = 'unicode', # works
                                stop_words = 'english', # works
                                lowercase = True, # works
                                max_df = 0.5, # works
                                min_df = 10) # works

对于那些想要删除少于3个字符的数字,标点和单词(但不知道如何)的人,这是一种在Pandas数据框中工作时对我有用的方法
# when working from Pandas dataframe

df['TEXT'] = df['TEXT'].str.replace('\d+', '') # for digits
df['TEXT'] = df['TEXT'].str.replace(r'(\b\w{1,2}\b)', '') # for words
df['TEXT'] = df['TEXT'].str.replace('[^\w\s]', '') # for punctuation

最佳答案

它应该是:

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
# NOTE:                        ---------------------->  ^^

代替:
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,

关于python - Sklearn : adding lemmatizer to CountVectorizer,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47423854/

10-12 21:32