阿帕奇星火朴素贝叶斯基于文本分类 | 阿帕奇星火朴素贝叶斯基于文本分类

本文介绍了阿帕奇星火朴素贝叶斯基于文本分类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试使用Apache火花文档分类。

im trying to use Apache Spark for document classification.

例如，我有两种类型的类（C和J）

For example i have two types of Class (C and J)

列车数据：

C, Chinese Beijing Chinese
C, Chinese Chinese Shanghai
C, Chinese Macao
J, Tokyo Japan Chinese

和测试数据：
中国中国中国日本东京//什么是IST J或C 3

And test data is :Chinese Chinese Chinese Tokyo Japan // What is ist J or C ?

我怎样才能培养和predict如以上数据。我做了朴素贝叶斯文本分类与Apache Mahout的，但是与Apache星火没有。

How i can train and predict as above datas. I did Naive Bayes text classification with Apache Mahout, however no with Apache Spark.

我怎样才能做到这一点与Apache火花？

How can i do this with Apache Spark?

推荐答案

是的，它看起来并不像有什么简单的工具来做到这一点在星火呢。但是你可以先创建术语词典做手工。然后计算间接火力每个术语，然后将每个文件转换成使用TF-IDF得分矢量。

Yes, it doesn't look like there is any simple tool to do that in Spark yet. But you can do it manually by first creating a dictionary of terms. Then compute IDFs for each term and then convert each documents into vectors using the TF-IDF scores.

有上<一后href=\"http://chimpler.word$p$pss.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/\" rel=\"nofollow\">http://chimpler.word$p$pss.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/解释如何做到这一点（有一些code也一样）。

There is a post on http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ that explains how to do it (with some code as well).

这篇关于阿帕奇星火朴素贝叶斯基于文本分类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！