当我为一组文档运行tfidf时,它返回了一个tfidf矩阵,如下所示。
(1, 12) 0.656240233446
(1, 11) 0.754552023393
(2, 6) 1.0
(3, 13) 1.0
(4, 2) 1.0
(7, 9) 1.0
(9, 4) 0.742540927053
(9, 5) 0.66980069547
(11, 19) 0.735138466738
(11, 7) 0.677916982176
(12, 18) 1.0
(13, 14) 0.697455191865
(13, 11) 0.716628394177
(14, 5) 1.0
(15, 8) 1.0
(16, 17) 1.0
(18, 1) 1.0
(19, 17) 1.0
(22, 13) 1.0
(23, 3) 1.0
(25, 6) 1.0
(26, 19) 0.476648253537
(26, 7) 0.879094103268
(28, 10) 0.532672175403
(28, 7) 0.523456282204
我想知道这是什么,我无法理解它是如何提供的。
当我处于调试模式时,我了解了索引,indptr和数据...这些东西与给定的数据相关。这些是什么?
数字上有很多混乱,如果我说括号中的第一个元素是基于我的预测的文档,那么我看不到第0、5、6个文档。
请帮助我弄清楚它在这里如何工作。但是我知道Wiki的tfidf的一般工作原理,它记录了反向文档和其他内容。我只想知道这3种不同的数字是什么,它指的是什么?
源代码是:
#This contains the list of file names
_filenames =[]
#This conatains the list if contents/text in the file
_contents = []
#This is a dict of filename:content
_file_contents = {}
class KmeansClustering():
def kmeansClusters(self):
global _report
self.num_clusters = 5
km = KMeans(n_clusters=self.num_clusters)
vocab_frame = TokenizingAndPanda().createPandaVocabFrame()
self.tfidf_matrix, self.terms, self.dist = TfidfProcessing().getTfidFPropertyData()
km.fit(self.tfidf_matrix)
self.clusters = km.labels_.tolist()
joblib.dump(km, 'doc_cluster2.pkl')
km = joblib.load('doc_cluster2.pkl')
class TokenizingAndPanda():
def tokenize_only(self,text):
'''
This function tokenizes the text
:param text: Give the text that you want to tokenize
:return: it gives the filter tokes
'''
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
def tokenize_and_stem(self,text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [_stemmer.stem(t) for t in filtered_tokens]
return stems
def getFilnames(self):
'''
:return:
'''
global _path
global _filenames
path = _path
_filenames = FileAccess().read_all_file_names(path)
def getContentsForFilenames(self):
global _contents
global _file_contents
for filename in _filenames:
content = FileAccess().read_the_contents_from_files(_path, filename)
_contents.append(content)
_file_contents[filename] = content
def createPandaVocabFrame(self):
global _totalvocab_stemmed
global _totalvocab_tokenized
#Enable this if you want to load the filenames and contents from a file structure.
# self.getFilnames()
# self.getContentsForFilenames()
# for name, i in _file_contents.items():
# print(name)
# print(i)
for i in _contents:
allwords_stemmed = self.tokenize_and_stem(i)
_totalvocab_stemmed.extend(allwords_stemmed)
allwords_tokenized = self.tokenize_only(i)
_totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': _totalvocab_tokenized}, index=_totalvocab_stemmed)
print(vocab_frame)
return vocab_frame
class TfidfProcessing():
def getTfidFPropertyData(self):
tfidf_vectorizer = TfidfVectorizer(max_df=0.4, max_features=200000,
min_df=0.02, stop_words='english',
use_idf=True, tokenizer=TokenizingAndPanda().tokenize_and_stem, ngram_range=(1, 1))
# print(_contents)
tfidf_matrix = tfidf_vectorizer.fit_transform(_contents)
terms = tfidf_vectorizer.get_feature_names()
dist = 1 - cosine_similarity(tfidf_matrix)
return tfidf_matrix, terms, dist
最佳答案
将tfidf应用于数据的结果通常是2D矩阵A,其中A_ij是第i个文档中的标准化第j个词(单词)频率。您在输出中看到的是该矩阵的稀疏表示,换句话说-仅打印出非零元素,因此:
(1, 12) 0.656240233446
表示第一个文档中的第12个单词(根据sklearn建立的一些词汇)的归一化频率为0.656240233446。 “丢失”位为零,这意味着例如在第一个文档中找不到第三个单词(因为没有
(1,3)
),依此类推。缺少某些文档的事实是由于您的特定代码/数据(未包括在内)造成的,也许您手动设置了词汇表?还是要考虑最大数量的功能? TfidfVectorizer中有许多参数可能会导致这种情况,但是如果没有确切的代码(和一些示例性数据),就什么也不能说。例如,设置
min_df
可能会导致(因为它丢弃非常稀有的单词)类似于max_features
(效果相同)关于python-3.x - 理想情况下给出的tfidf矩阵是什么,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/42489589/