问题描述
我尝试使用Apache火花文档分类。
im trying to use Apache Spark for document classification.
例如,我有两种类型的类(C和J)
For example i have two types of Class (C and J)
列车数据:
C, Chinese Beijing Chinese
C, Chinese Chinese Shanghai
C, Chinese Macao
J, Tokyo Japan Chinese
和测试数据:
中国中国中国日本东京//什么是IST J或C 3
And test data is :Chinese Chinese Chinese Tokyo Japan // What is ist J or C ?
我怎样才能培养和predict如以上数据。我做了朴素贝叶斯文本分类与Apache Mahout的,但是与Apache星火没有。
How i can train and predict as above datas. I did Naive Bayes text classification with Apache Mahout, however no with Apache Spark.
我怎样才能做到这一点与Apache火花?
How can i do this with Apache Spark?
推荐答案
是的,它看起来并不像有什么简单的工具来做到这一点在星火呢。但是你可以先创建术语词典做手工。然后计算间接火力每个术语,然后将每个文件转换成使用TF-IDF得分矢量。
Yes, it doesn't look like there is any simple tool to do that in Spark yet. But you can do it manually by first creating a dictionary of terms. Then compute IDFs for each term and then convert each documents into vectors using the TF-IDF scores.
有上<一后href=\"http://chimpler.word$p$pss.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/\" rel=\"nofollow\">http://chimpler.word$p$pss.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/解释如何做到这一点(有一些code也一样)。
There is a post on http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ that explains how to do it (with some code as well).
这篇关于阿帕奇星火朴素贝叶斯基于文本分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!