问题描述
我想计算存储在HBase中的文档的TF(Term Frequency)和IDF(Inverse Document Frequency)。我还想保存在HBase表中计算TF,并将计算的IDF保存在另一个HBase表中。
您可以引导我完成吗?
我从 Mahout 0.4
看过 BayesTfIdfDriver
,但是我没有领先。
解决方案的大纲非常简单:
- 对您的hbase表进行统计,为每个单词存储术语频率和文档频率
- 在缩小阶段存储每个单词的术语频率和文档频率 >
- 鉴于文档数量,再次扫描聚合结果并根据文档频率计算IDF。
TF-IDF上的维基百科页面是一个很好的参考记住公式的详细信息: http://en.wikipedia.org/wiki/Tf * idf
I want to calculate the TF (Term Frequency) and the IDF (Inverse Document Frequency) of documents that are stored in HBase.
I also want to save the calculated TF in a HBase table, also save the calculated IDF in another HBase table.
Can you guide me through?
I have looked at BayesTfIdfDriver
from Mahout 0.4
but I am not getting a head start.
The outline of a solution is pretty straight forward:
- do a word count over your hbase tables, storing both term frequency and document frequency for each word
- in your reduce phase aggregate the term frequency and document frequency for each word
- Given a count of your documents, scan through your aggregated results one more time and calculate the IDF based off of the document frequency.
The wikipedia page on TF-IDF is a good reference to remember the details of the formula: http://en.wikipedia.org/wiki/Tf*idf
这篇关于使用HBase作为数据源来计算文档的TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!