问题描述
我想在一组文档中找到最相关的词.
I would like to find the most relevant words over a set of documents.
我想在 3 个文档上调用 Tf Idf 算法并返回一个包含每个单词及其频率的 csv 文件.
I would like to call a Tf Idf algorithm over 3 documents and return a csv file containing each word and its frequency.
之后,我只取高数的我会使用它们.
After that, I will take only the ones with a high number and I will use them.
我发现这个实现可以满足我的需求 https://github.com/mccurdyc/tf-idf/.
I found this implementation that does what I need https://github.com/mccurdyc/tf-idf/.
我使用 subprocess
库调用该 jar.但是这段代码有一个很大的问题:它在分析单词时犯了很多错误.它混合了一些词,它有 '
和 -
的问题(我认为).我在 3 本书(哈利波特)的文本中使用它,例如,我正在获取诸如 hermiones、hermionell、riddlehermione、thinghermione
之类的词,而不仅仅是 hermione
csv 文件.
I call that jar using the subprocess
library. But there is a huge problem in that code: it commits a lot of mistake in analyzing words. It mixs some words, it has problems with '
and -
(I think). I am using it over the text of 3 books (Harry Potter) and , for example, I am obtaining words such hermiones, hermionell, riddlehermione, thinghermione
instead of just hermione
in the csv file.
我做错了什么吗?你能给我一个 Tf idf 算法的工作实现吗?有没有这样做的python库?
Am I doing wrong something? Can you give me a working implementation of the Tf idf algorithm? Is there a python library that does that?
推荐答案
这里是使用 scikit-learn.在应用它之前,您可以 word_tokenize()
和词干你的话.
Here is an implementation of the Tf-idf algorithm using scikit-learn.Before applying it, you can word_tokenize()
and stem your words.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
这篇关于Python Tf idf 算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!