问题描述
对不起,我英语不太好...我来自德国;)
first up sorry for my not so perfect English... I am from Germany ;)
因此,对于我的研究项目(学士学位论文),我需要分析关于某些公司和品牌的推文.为此,我将需要编写自己的程序脚本/使用某种经过修改的开放源代码(没有API'-我需要了解正在发生的事情).
So, for a research project of mine (Bachelor thesis) I need to analyze the sentiment of tweets about certain companies and brands. For this purpose I will need to script my own program / use some sort of modified open source code (no APIs' - I need to understand what is happening).
在下面,您会找到我发现的一些NLP应用程序的列表.我的问题现在是您推荐哪种方法?哪一个不需要花很长时间调整代码?
Below you will find a list of some of the NLP Applications I found. My Question now is which one and which approach would you recommend? And which one does not require long nights adjusting the code?
例如:当我为音乐播放器> iPod<有人写道:这真是糟糕的一天,但至少我的iPod使我开心"或什至更难:这真是糟糕的一天,但至少我的iPod弥补了这一点"
For example: When I screen twitter for the music player >iPod< and someone writes: "It's a terrible day but at least my iPod makes me happy" or even harder: "It's a terrible day but at least my iPod makes up for it"
哪个软件足够聪明,可以了解到重点放在iPod而不是天气上?
Which software is smart enough to understand that the focused is on iPod and not the weather?
还有哪款软件是可扩展的/资源高效的(我想分析几条推文,又不想花费数千美元)?
Also which software is scalable / resource efficient (I want to analyze several tweets and don't want to spend thousands of dollars)?
机器学习和数据挖掘
Weka -是用于数据挖掘的机器学习算法的集合.它是最流行的文本分类框架之一.它包含多种算法的实现,包括朴素贝叶斯和支持向量机(SVM,在SMO下列出)[注意:其他常用的非Java SVM实现是SVM-Light,LibSVM和SVMTorch].一个相关的项目是Kea(关键字短语提取算法),它是一种从文本文档中提取关键字短语的算法.
Weka - is a collection of machine learning algorithms for data mining. It is one of the most popular text classification frameworks. It contains implementations of a wide variety of algorithms including Naive Bayes and Support Vector Machines (SVM, listed under SMO) [Note: Other commonly used non-Java SVM implementations are SVM-Light, LibSVM, and SVMTorch]. A related project is Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents.
Apache Lucene Mahout -一个孵化器项目,用于在Hadoop map-reduce框架之上创建通用机器学习算法的高度可扩展的分布式实现.
Apache Lucene Mahout - An incubator project to created highly scalable distributed implementations of common machine learning algorithms on top of the Hadoop map-reduce framework.
NLP工具
LingPipe -(从技术上讲不是开源的,请参见下文)Alias-I的Lingpipe是一套Java语言工具,用于对文本进行语言处理,包括实体提取,语音标记(pos),聚类,分类等.它是业界最成熟且使用最广泛的开源NLP工具包之一.它以速度,稳定性和可伸缩性着称.它的最佳功能之一是大量精心编写的教程,可帮助您入门.它们具有与竞争相关的链接列表,包括学术工具和工业工具.请务必查看他们的博客. LingPipe是根据免版税的商业许可发布的,该许可包括源代码,但从技术上讲,它不是开源".
LingPipe - (not technically 'open-source, see below) Alias-I's Lingpipe is a suite of java tools for linguistic processing of text including entity extraction, speech tagging (pos) , clustering, classification, etc... It is one of the most mature and widely used open source NLP toolkits in industry. It is known for it's speed, stability, and scalability. One of its best features is the extensive collection of well-written tutorials to help you get started. They have a list of links to competition, both academic and industrial tools. Be sure to check out their blog. LingPipe is released under a royalty-free commercial license that includes the source code, but it's not technically 'open-source'.
OpenNLP -托管了多种基于Java的NLP工具,它们使用以下功能执行句子检测,标记化,词性标记,分块和解析,命名实体检测以及共引用分析Maxent机器学习包.
OpenNLP - hosts a variety of java-based NLP tools which perform sentence detection, tokenization, part-of-speech tagging, chunking and parsing, named-entity detection, and co-reference analysis using the Maxent machine learning package.
Stanford解析器和词性(POS)Tagger -Stanford NLP小组用于句子解析和语音标记的Java程序包.它具有概率自然语言解析器,高度优化的PCFG和词汇化的依赖解析器以及词汇化的PCFG解析器的实现.它具有完整的GNU GPL许可证.
Stanford Parser and Part-of-Speech (POS) Tagger - Java packages for sentence parsing and part of speech tagging from the Stanford NLP group. It has implementations of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. It's has a full GNU GPL license.
OpenFST -用于处理加权有限状态自动机的软件包.这些通常用于表示概率模型.它们用于为语音识别,OCR纠错,机器翻译和各种其他任务建模文本.该库是由Google Research和NYU的贡献者开发的.这是一个旨在快速且可扩展的C ++库.
OpenFST - A package for manipulating weighted finite state automata. These are often used to represented a probablistic model. They are used to model text for speech recognition, OCR error correction, machine translation, and a variety of other tasks. The library was developed by contributors from Google Research and NYU. It is a C++ library that is meant to be fast and scalable.
NTLK -自然语言工具包是用于教授和研究分类,聚类,语音标记和解析等功能的工具.它包含一组用于实验的教程和数据集.它是由墨尔本大学的史蒂芬·伯德(Steven Bird)撰写的.
NTLK - The natural language toolkit is a tool for teaching and researching classification, clustering, speech tagging and parsing, and more. It contains a set of tutorials and data sets for experimentation. It is written by Steven Bird, from the University of Melbourne.
意见查找器-一种执行主观分析的系统,可自动识别文本中是否存在意见,情感,推测和其他私人状态.具体来说,OpinionFinder旨在识别主观句子并标记这些句子中的主观性,包括主观性的来源(持有人)和表达正面或负面情绪的短语中包含的单词.
Opinion Finder - A system that performs subjectivity analysis, automatically identifying when opinions, sentiments, speculations and other private states are present in text. Specifically, OpinionFinder aims to identify subjective sentences and to mark various aspects of the subjectivity in these sentences, including the source (holder) of the subjectivity and words that are included in phrases expressing positive or negative sentiments.
Tawlk/osae -用于对社交文本进行情感分类的python库.最终目标是要有一个简单的可以正常工作"的库.它应该有容易进入的屏障,并且要有完整的文件记录.我们使用停用词过滤以及在negwords.txt和poswords.txt上收集的推文来实现最佳准确性.
Tawlk/osae - A python library for sentiment classification on social text. The end-goal is to have a simple library that "just works". It should have an easy barrier to entry and be thoroughly documented. We have acheived best accuracy using stopwords filtering with tweets collected on negwords.txt and poswords.txt
GATE -GATE已有15年的历史,并且正在积极地用于涉及人类语言的所有类型的计算任务. GATE擅长于各种形状和大小的文本分析.从大型公司到小型创业公司,从数百万欧元的研究财团到本科生项目,我们的用户社区是该类型系统中最大,最多样化的,并且分布在除大洲之外的所有大洲1.
GATE - GATE is over 15 years old and is in active use for all types of computational task involving human language. GATE excels at text analysis of all shapes and sizes. From large corporations to small startups, from €multi-million research consortia to undergraduate projects, our user community is the largest and most diverse of any system of this type, and is spread across all but one of the continents1.
textir -一套用于文本和情感挖掘的工具.其中包括用于稀疏多项式逻辑回归的"mnlm"功能,一个简洁的偏最小二乘子程序"pls"以及用于在潜在主题模型中进行有效估计和维度选择的"topics"功能.
textir - A suite of tools for text and sentiment mining. This includes the ‘mnlm’ function, for sparse multinomial logistic regression, ‘pls’, a concise partial least squares routine, and the ‘topics’ function, for efficient estimation and dimension selection in latent topic models.
NLP Toolsuite-这里的JULIE实验室提供了一个全面的NLP工具套件,用于语义搜索,信息提取和文本挖掘的应用目的.我们大多数不断扩展的工具套件都是基于机器学习方法的,因此与领域和语言无关.
NLP Toolsuite - The JULIE Lab here offers a comprehensive NLP tool suite for the application purposes of semantic search, information extraction and text mining. Most of our continuously expanding tool suite is based on machine learning methods and thus is domain- and language independent.
...
附带说明:您会推荐Twitter流还是get API?
On a side note: Would you recommend the twitter streaming or the get API?
对我来说,我是python和java的粉丝;)
As to me, I am a fan of python and java ;)
非常感谢您的帮助!
推荐答案
我不确定我能提供多少帮助,但是我之前曾与手工NLP合作过.我想到了两个问题-并非所有产品都与语言无关(不是计算机语言就是人类语言).如果您打算分析德语推文,那么所选产品能够处理德语将非常重要.我知道很明显,但是很容易忘记.还有一个事实是,在Twitter上有很多紧缩和首字母缩写词,并且语言结构受到字符数限制的约束,这意味着语法并不总是与预期的语言结构匹配.
I'm not sure how much I can help, but I have worked with hand-rolled NLP before. A couple of issues come to mind - not all products are language agnostic (human language that is, not computer language). If you're planning on analysing German tweets, it's going to be important that your selected product is able to handle the German language. Obvious I know, but easy to forget. Then there's the fact that it's twitter where contractions and acronyms abound, and the language structure is constrained by the character limit which means that the grammar won't always match the expected structure of the language.
在英语中,如果必须编写自己的代码,则可以简化从句子中提取名词的过程.专有名词以大写字母开头,一串这样的单词(可能包括"of")是名词短语的一个示例.以"/an/my/his/hers/the/this/these/these"开头的单词将是形容词或名词.不幸的是,这变得越来越难.
In English, pulling nouns from a sentence can be simplified somewhat if you ever have to write code of your own. Proper nouns have initial capitals and a string of such words (possibly including "of") is an example of a noun phrase. A word preceeded by "a/an/my/his/hers/the/this/these/those" is going to be either an adjective or a noun. It gets harder after that unfortunately.
有一些规则可以识别复数形式,但是也有很多例外情况.我在这里当然是在谈论英语,我的德语口语很差,无法帮助我理解我担心的语法.
There are rules which help identify plurals, but there are also lots of exceptions. I'm talking about English here of course, my very poor spoken German doesn't help me understand that grammar I'm afraid.
这篇关于关于情感分析的自然语言处理工具列表-您推荐哪一种的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!