我用

singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=
(4,6),stop_words=my_stop_words, max_features=50).fit([text])


想知道为什么我的功能(例如“聊天室”)中存在空格

如何避免这种情况?我需要自己对此进行tekenize和预处理吗?

最佳答案

使用analyzer='word'

当我们使用char_wb时,矢量化器会填充空格,因为它不会对使用character_n_grams进行的单词检查进行标记化。

根据Documentation:


  分析器:字符串,{'word','char','char_wb'}或可调用
  
  该功能部件应由单词还是字符n-gram组成。选项“ char_wb”
  仅从单词边界内的文本创建字符n-gram;
  单词边缘的n-gram用空格填充。


请看以下示例,以了解

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])


输出:


  [(4,'and'),(5,'and'),(4,'doc'),(5,'docu'),(6,'docum'),
  (4,'fir'),(5,'firs'),(6,'first'),(4,'is'),(4,'one'),
  (5,'one。'),(6,'one。'),(4,'sec'),(5,'seco'),(6,'secon'),
  (4,'the'),(5,'the'),(4,'thi'),(5,'thir'),(6,'third'),
  (5,'this'),(6,'this'),(4,'and'),(4,'cond'),(5,'cond'),
  (4,'cume'),(5,'cumen'),(6,'cument'),(4,'docu'),(5,'docum'),
  (6,'docume'),(4,'econ'),(5,'econd'),(6,'econd'),(4,'ent'),
  (4,'ent。'),(5,'ent。'),(4,'ent?'),(5,'ent?'),(4,'firs'),(5,
  'first'),(6,'first'),(4,'hird'),(5,'hird'),(4,'his'),(4,
  'ird'),(4,'irst'),(5,'irst'),(4,'ment'),(5,'ment'),(5,
  'ment。'),(6,'ment。'),(5,'ment?'),(6,'ment?'),(4,'ne。'),(4,
  'nt。 '),(4,'nt?'),(4,'ocum'),(5,'ocume'),(6,'ocumen'),(4,
  'ond'),(4,'one。'),(5,'one。'),(4,'rst'),(4,'seco'),(5,
  “ secon”),(6,“第二”),(4,“ the”),(4,“ thir”),(5,“第三”),(6,
  'third'),(4,'this'),(5,'this'),(4,'umen'),(5,'ument'),(6,
  'ument'),(6,'ument。'),(6,'ument?')]

关于python - TF-IDF矢量化器在具有char_wb的特征词中包含空格?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54308898/

10-12 01:34