问题描述
我愿意开始开发 NLP 项目.我不知道很多可用的工具.在谷歌搜索了大约一个月后.我意识到 openNLP 可以成为我的解决方案.
I am willing to start developing a project on NLP. I dont know much of the tools available. After googling for about a month. I realized that openNLP can be my solution.
不幸的是,我没有看到任何关于使用 API 的完整教程.所有这些都缺乏一些通用步骤.我需要一个从地面开始的教程.我在网站上看到很多下载,但不知道如何使用它们?我需要训练还是什么?..这是我想知道的-
Unfortunately i dont see any complete tutorial over using the API. All of them are lacking of some general steps. I need a tutorial from ground level. I have seen a lot of downloads over the site but dont know how to use them? do i need to train or something?.. Here is what i want to know-
如何安装/设置可以-
- 解析一个英文句子词
- 识别不同的词性
推荐答案
你说你需要解析"每个句子.您可能已经知道这一点,但为了明确起见,在 NLP 中,术语解析"通常意味着恢复一些分层的句法结构.最常见的类型是组成结构(例如,通过上下文无关语法)和依赖结构.
You say that you need to 'parse' each sentence. You probably already know this, but just to be explicit, in NLP, the term 'parse' usually means to recover some hierarchical syntactic structure. The most common types are constituent structure (e.g., via a context-free grammar) and dependency structure.
如果您需要层次结构,我建议您考虑从解析器开始.我所知道的大多数解析器在解析过程中都包含 POS 标记,并且可能提供比有限状态 POS 标记器更高的准确度标记(警告 - 我对成分解析器比对依赖解析器更熟悉.部分或大多数依赖解析器可能会需要 POS 标签作为输入).
If you need hierarchical structure, I'd recommend you consider just starting with a parser. Most parsers I'm aware of include POS tagging during parsing, and may provide higher accuracy tagging than finite-state POS taggers (Caveat - I'm much more familiar with constituent parsers than with dependency parsers. It's possible some or most dependency parsers would require POS tags as input).
解析的最大缺点是时间复杂度.有限状态 POS 标注器通常以每秒数千个单词的速度运行.即使是贪婪的依赖解析器也相当慢,并且组成解析器通常以每秒 1-5 个句子的速度运行.因此,如果您不需要层次结构,您可能希望坚持使用有限状态词性标注器以提高效率.
The big downside to parsing is the time complexity. Finite-state POS taggers often run at thousands of words per second. Even greedy dependency parsers are considerably slower, and constituent parsers generally run at 1-5 sentences per second. So if you don't need hierarchical structure, you probably want to stick with a finite-state POS tagger for efficiency.
如果您决定需要解析结构,一些建议:
If you do decide you need parse structure, a few recommendations:
我认为@aab 建议的斯坦福解析器包括一个组成解析器和一个依赖解析器.
I think the Stanford parser suggested by @aab includes both a constituent parser and a dependency parser.
Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ) 非常漂亮著名的 PCFG 成分解析器,达到了最先进的准确度(我相信等于或优于斯坦福解析器),并且相当高效(每秒约 3-5 个句子).
The Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ) is a pretty well-known PCFG constituent parser, achieves state-of-the-art accuracy (equal or superior to the Stanford parser, I believe), and is reasonably efficient (~3-5 sentences per second).
BUBS 解析器 ( http://code.google.com/p/bubs-parser/) 也可以使用高精度伯克利语法运行,并将效率提高到 15-20 句/秒左右.完全公开 - 我是该解析器的主要研究人员之一.
The BUBS Parser ( http://code.google.com/p/bubs-parser/ ) can also run with the high-accuracy Berkeley grammar, and improves efficiency to around 15-20 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.
警告:这两个解析器都是研究代码,会产生所有问题.但我很想看到人们实际使用 BUBS,所以如果它对您有用,请尝试一下,并在遇到问题、意见、建议等时与我联系.
Warning: both of these parsers are research code, with all the problems that engenders. But I'd love to see people actually using BUBS, so if it's of use to you, give it a try and contact me with problems, comments, suggestions, etc.
如果需要,还可以参考一些维基百科的背景资料:
And a couple Wikipedia references for background if needed:
依赖语法:http://en.wikipedia.org/wiki/Dependency_grammar
这篇关于Java 的简单自然语言处理启动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!