我用
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=
(4,6),stop_words=my_stop_words, max_features=50).fit([text])
想知道为什么我的功能(例如“聊天室”)中存在空格
如何避免这种情况?我需要自己对此进行tekenize和预处理吗?
最佳答案
使用analyzer='word'
。
当我们使用char_wb
时,矢量化器会填充空格,因为它不会对使用character_n_grams
进行的单词检查进行标记化。
根据Documentation:
分析器:字符串,{'word','char','char_wb'}或可调用
该功能部件应由单词还是字符n-gram组成。选项“ char_wb”
仅从单词边界内的文本创建字符n-gram;
单词边缘的n-gram用空格填充。
请看以下示例,以了解
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])
输出:
[(4,'and'),(5,'and'),(4,'doc'),(5,'docu'),(6,'docum'),
(4,'fir'),(5,'firs'),(6,'first'),(4,'is'),(4,'one'),
(5,'one。'),(6,'one。'),(4,'sec'),(5,'seco'),(6,'secon'),
(4,'the'),(5,'the'),(4,'thi'),(5,'thir'),(6,'third'),
(5,'this'),(6,'this'),(4,'and'),(4,'cond'),(5,'cond'),
(4,'cume'),(5,'cumen'),(6,'cument'),(4,'docu'),(5,'docum'),
(6,'docume'),(4,'econ'),(5,'econd'),(6,'econd'),(4,'ent'),
(4,'ent。'),(5,'ent。'),(4,'ent?'),(5,'ent?'),(4,'firs'),(5,
'first'),(6,'first'),(4,'hird'),(5,'hird'),(4,'his'),(4,
'ird'),(4,'irst'),(5,'irst'),(4,'ment'),(5,'ment'),(5,
'ment。'),(6,'ment。'),(5,'ment?'),(6,'ment?'),(4,'ne。'),(4,
'nt。 '),(4,'nt?'),(4,'ocum'),(5,'ocume'),(6,'ocumen'),(4,
'ond'),(4,'one。'),(5,'one。'),(4,'rst'),(4,'seco'),(5,
“ secon”),(6,“第二”),(4,“ the”),(4,“ thir”),(5,“第三”),(6,
'third'),(4,'this'),(5,'this'),(4,'umen'),(5,'ument'),(6,
'ument'),(6,'ument。'),(6,'ument?')]
关于python - TF-IDF矢量化器在具有char_wb的特征词中包含空格?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54308898/