I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?
Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.
How should i go about doing this ? Is my approach the right one ?
Which programming language would be best suited for the implementation ?
You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:
Python -使用基于Python的 自然语言工具包(NLTK) ,请参见 NLTK图书中的文档分类" 部分.
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby -如果您更喜欢Ruby,则可以使用 分类器 宝石.以下示例代码可检测"Family Guy"引语是否有趣-有趣.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
Perl -Perl具有 Algorithm :: NaiveBayes 模块,在包简介.
Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.
C#-C#程序员可以使用 nBayes .该项目的主页上有用于简单垃圾邮件/非垃圾邮件分类器的示例代码.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java -Java使用者有 Classifier4J .您可以在此处看到培训和评分代码段.
Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.
听起来您想从一组可以提示某些主题的关键字开始,然后使用这些关键字来 引导分类器 .
It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.
这是一个相当聪明的主意.看看 通过关键词,EM和收缩自举进行文本经典化 McCallum和Nigam(1999)提出的"strong> ".通过采用这种方法,他们可以将分类准确率从仅使用硬编码关键字获得的45%提高到使用自举Naive Bayes分类器获得的66%.对于他们的数据,后者接近人类的共识水平,因为人们在72%的时间内就文件标签达成了共识.
This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.