python - 在python scikit-learn中了解文本特征提取TfidfVectorizer

阅读scikit-learn中有关文本特征提取的文档，我不确定TfidfVectorizer(可能是其他矢量化器)可用的不同参数如何影响结果。

以下是我不确定它们如何工作的论点:

TfidfVectorizer(stop_words='english',  ngram_range=(1, 2), max_df=0.5, min_df=20, use_idf=True)

关于stop_words/max_df的使用，文档很清楚(两者都有相似的效果，可能是一个可以代替另一个)。但是，我不确定这些选项是否应与ngram一起使用。哪一个先出现/处理(ngrams或stop_words)？为什么？根据我的实验，首先会删除停用词，但是ngram的目的是提取短语，等等。我不确定此序列的效果(先删除Stop，然后再对ngram进行排序)。

其次，将max_df/min_df参数与use_idf参数一起使用是否有意义？这些目的不是很相似吗？

最佳答案

我在这篇文章中看到了几个问题。

您确实必须大量使用它来发展直觉感(无论如何，这一直是我的经验)。

TfidfVectorizer是一句话的方法。在NLP中，单词序列及其窗口很重要。这种破坏破坏了某些环境。

如何控制输出哪些 token ？

将ngram_range设置为(1,1)仅输出一个单词标记，(1,2)表示一个单词和两个单词标记，(2，3)表示两个单词和三个单词标记，等等。
ngram_range与analyzer协同工作。将analyzer设置为“word”以输出单词和短语，或将其设置为“char”以输出字符ngram。

如果您希望输出同时具有“单词”和“字符”功能，请使用sklearn的FeatureUnion。示例here。

如何删除不需要的内容？

使用stop_words删除意义不大的英语单词。

可以在以下位置找到sklearn使用的停用词列表:

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

删除停用词的逻辑与以下事实有关:这些词没有很多含义，并且在大多数文本中都出现很多:

[('the', 79808),
 ('of', 40024),
 ('and', 38311),
 ('to', 28765),
 ('in', 22020),
 ('a', 21124),
 ('that', 12512),
 ('he', 12401),
 ('was', 11410),
 ('it', 10681),
 ('his', 10034),
 ('is', 9773),
 ('with', 9739),
 ('as', 8064),
 ('i', 7679),
 ('had', 7383),
 ('for', 6938),
 ('at', 6789),
 ('by', 6735),
 ('on', 6639)]

由于停用词通常具有较高的频率，因此最好使用max_df作为0.95的浮点数来删除前5％的词，但是您假定前5％都是停用词，而事实并非如此。这实际上取决于您的文本数据。在我的工作中，最常见的单词不是短语是停用词，因为我在非常具体的主题中使用密集文本(搜索查询数据)。

使用min_df作为整数可以删除稀有单词。如果它们只出现一次或两次，它们将不会增加太多值(value)，并且通常是晦涩难懂的。此外，它们通常很多，因此用min_df=5忽略它们可以极大地减少您的内存消耗和数据大小。

如何包含被剥离的内容？
token_pattern使用正则表达式模式\b\w\w+\b，这意味着 token 的长度必须至少为2个字符，以便删除“I”，“a”之类的单词，并删除0-9之类的数字。您还会注意到它删除了撇号

让我们做一点测试。

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

docs = np.array(['what is tfidf',
        'what does tfidf stand for',
        'what is tfidf and what does it stand for',
        'tfidf is what',
        "why don't I use tfidf",
        '1 in 10 people use tfidf'])

tfidf = TfidfVectorizer(use_idf=False, norm=None, ngram_range=(1, 1))
matrix = tfidf.fit_transform(docs).toarray()

df = pd.DataFrame(matrix, index=docs, columns=tfidf.get_feature_names())

for doc in docs:
    print(' '.join(word for word in doc.split() if word not in ENGLISH_STOP_WORDS))

打印输出:

tfidf
does tfidf stand
tfidf does stand
tfidf
don't I use tfidf
1 10 people use tfidf

现在让我们打印df:

                                           10  and  does  don  for   in   is  \
what is tfidf                             0.0  0.0   0.0  0.0  0.0  0.0  1.0
what does tfidf stand for                 0.0  0.0   1.0  0.0  1.0  0.0  0.0
what is tfidf and what does it stand for  0.0  1.0   1.0  0.0  1.0  0.0  1.0
tfidf is what                             0.0  0.0   0.0  0.0  0.0  0.0  1.0
why don't I use tfidf                     0.0  0.0   0.0  1.0  0.0  0.0  0.0
1 in 10 people use tfidf                  1.0  0.0   0.0  0.0  0.0  1.0  0.0

                                           it  people  stand  tfidf  use  \
what is tfidf                             0.0     0.0    0.0    1.0  0.0
what does tfidf stand for                 0.0     0.0    1.0    1.0  0.0
what is tfidf and what does it stand for  1.0     0.0    1.0    1.0  0.0
tfidf is what                             0.0     0.0    0.0    1.0  0.0
why don't I use tfidf                     0.0     0.0    0.0    1.0  1.0
1 in 10 people use tfidf                  0.0     1.0    0.0    1.0  1.0

                                          what  why
what is tfidf                              1.0  0.0
what does tfidf stand for                  1.0  0.0
what is tfidf and what does it stand for   2.0  0.0
tfidf is what                              1.0  0.0
why don't I use tfidf                      0.0  1.0
1 in 10 people use tfidf                   0.0  0.0

笔记:

设置这些参数时，

use_idf=False, norm=None等效于使用sklearn的CountVectorizer。它只会返回计数。

注意单词“do n't”已转换为“don”。在这里，您可以将token_pattern更改为token_pattern=r"\b\w[\w']+\b"之类，以包含撇号。

我们看到很多停用词

让我们删除停用词，然后再次查看df:

tfidf = TfidfVectorizer(use_idf=False, norm=None, stop_words='english', ngram_range=(1, 2))

输出:

                                           10  10 people  does  does stand  \
what is tfidf                             0.0        0.0   0.0         0.0
what does tfidf stand for                 0.0        0.0   1.0         0.0
what is tfidf and what does it stand for  0.0        0.0   1.0         1.0
tfidf is what                             0.0        0.0   0.0         0.0
why don't I use tfidf                     0.0        0.0   0.0         0.0
1 in 10 people use tfidf                  1.0        1.0   0.0         0.0

                                          does tfidf  don  don use  people  \
what is tfidf                                    0.0  0.0      0.0     0.0
what does tfidf stand for                        1.0  0.0      0.0     0.0
what is tfidf and what does it stand for         0.0  0.0      0.0     0.0
tfidf is what                                    0.0  0.0      0.0     0.0
why don't I use tfidf                            0.0  1.0      1.0     0.0
1 in 10 people use tfidf                         0.0  0.0      0.0     1.0

                                          people use  stand  tfidf  \
what is tfidf                                    0.0    0.0    1.0
what does tfidf stand for                        0.0    1.0    1.0
what is tfidf and what does it stand for         0.0    1.0    1.0
tfidf is what                                    0.0    0.0    1.0
why don't I use tfidf                            0.0    0.0    1.0
1 in 10 people use tfidf                         1.0    0.0    1.0

                                          tfidf does  tfidf stand  use  \
what is tfidf                                    0.0          0.0  0.0
what does tfidf stand for                        0.0          1.0  0.0
what is tfidf and what does it stand for         1.0          0.0  0.0
tfidf is what                                    0.0          0.0  0.0
why don't I use tfidf                            0.0          0.0  1.0
1 in 10 people use tfidf                         0.0          0.0  1.0

                                          use tfidf
what is tfidf                                   0.0
what does tfidf stand for                       0.0
what is tfidf and what does it stand for        0.0
tfidf is what                                   0.0
why don't I use tfidf                           1.0
1 in 10 people use tfidf                        1.0

外卖:

发生了“不要使用”标记，因为don't I use去除了't，并且因为I少于两个字符，所以将其删除，因此单词被加入了don use……这实际上不是结构，并且可能会改变结构一点点!

答案:删除停用词，删除短字符，然后生成ngram，它们可以返回意外结果。

我认为，术语频率逆文档频率的全部目的是允许对频繁出现的单词(将出现在已排序频率列表顶部的单词)进行加权。此重新加权将采用最高频率的ngram，并将其向下移动到列表的较低位置。因此，应该处理max_df方案。

想要将它们移到列表的下方(“重新称重”/取消优先级设置)还是将其完全删除，可能更是个人选择。

我经常使用min_df，如果您使用的是庞大的数据集，则使用min_df是有意义的，因为稀有单词不会增加值(value)，只会引起很多处理问题。我并没有使用max_df太多，但是我确定在处理像Wikipedia一样的数据时，存在某些场景可能会删除顶部的x％。

关于python - 在python scikit-learn中了解文本特征提取TfidfVectorizer，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/47557417/