python - TF-IDF矢量化器在具有char_wb的特征词中包含空格？

我用

singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=
(4,6),stop_words=my_stop_words, max_features=50).fit([text])

想知道为什么我的功能（例如“聊天室”）中存在空格

如何避免这种情况？我需要自己对此进行tekenize和预处理吗？

最佳答案

使用analyzer='word'。

当我们使用char_wb时，矢量化器会填充空格，因为它不会对使用character_n_grams进行的单词检查进行标记化。

根据Documentation:

  分析器：字符串，{'word'，'char'，'char_wb'}或可调用

  该功能部件应由单词还是字符n-gram组成。选项“ char_wb”
  仅从单词边界内的文本创建字符n-gram；
  单词边缘的n-gram用空格填充。

请看以下示例，以了解

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

输出：

  [（4，'and'），（5，'and'），（4，'doc'），（5，'docu'），（6，'docum'），
  （4，'fir'），（5，'firs'），（6，'first'），（4，'is'），（4，'one'），
  （5，'one。'），（6，'one。'），（4，'sec'），（5，'seco'），（6，'secon'），
  （4，'the'），（5，'the'），（4，'thi'），（5，'thir'），（6，'third'），
  （5，'this'），（6，'this'），（4，'and'），（4，'cond'），（5，'cond'），
  （4，'cume'），（5，'cumen'），（6，'cument'），（4，'docu'），（5，'docum'），
  （6，'docume'），（4，'econ'），（5，'econd'），（6，'econd'），（4，'ent'），
  （4，'ent。'），（5，'ent。'），（4，'ent？'），（5，'ent？'），（4，'firs'），（5，
  'first'），（6，'first'），（4，'hird'），（5，'hird'），（4，'his'），（4，
  'ird'），（4，'irst'），（5，'irst'），（4，'ment'），（5，'ment'），（5，
  'ment。'），（6，'ment。'），（5，'ment？'），（6，'ment？'），（4，'ne。'），（4，
  'nt。 '），（4，'nt？'），（4，'ocum'），（5，'ocume'），（6，'ocumen'），（4，
  'ond'），（4，'one。'），（5，'one。'），（4，'rst'），（4，'seco'），（5，
  “ secon”），（6，“第二”），（4，“ the”），（4，“ thir”），（5，“第三”），（6，
  'third'），（4，'this'），（5，'this'），（4，'umen'），（5，'ument'），（6，
  'ument'），（6，'ument。'），（6，'ument？'）]

关于python - TF-IDF矢量化器在具有char_wb的特征词中包含空格？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/54308898/