本文介绍了Java的简单自然语言处理启动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我愿意开始开发NLP项目。我不太了解可用的工具。谷歌搜索大约一个月后。我意识到openNLP可以是我的解决方案。

I am willing to start developing a project on NLP. I dont know much of the tools available. After googling for about a month. I realized that openNLP can be my solution.

不幸的是,我没有看到任何关于使用API​​的完整教程。所有这些都缺乏一些一般性步骤。我需要从地面开始的教程。我在网站上看到了很多下载,但不知道如何使用它们?我需要训练什么?...这就是我想知道的事情 -

Unfortunately i dont see any complete tutorial over using the API. All of them are lacking of some general steps. I need a tutorial from ground level. I have seen a lot of downloads over the site but dont know how to use them? do i need to train or something?.. Here is what i want to know-

如何安装/设置一个nlp系统,可以 -

How to install / set up a nlp system which can-


  1. 解析英语句子

  2. 识别不同的词类


推荐答案

你说你需要'解析'每个句子。您可能已经知道这一点,但只是为了明确,在NLP中,术语解析通常意味着恢复一些分层的句法结构。最常见的类型是组成结构(例如,通过无上下文语法)和依赖结构。

You say that you need to 'parse' each sentence. You probably already know this, but just to be explicit, in NLP, the term 'parse' usually means to recover some hierarchical syntactic structure. The most common types are constituent structure (e.g., via a context-free grammar) and dependency structure.

如果你需要层次结构,我建议你考虑刚开始用解析器。我所知道的大多数解析器包括解析期间的POS标记,并且可能提供比有限状态POS标记更高的准确性标记(警告 - 我对组成解析器比熟悉依赖解析器更熟悉。可能一些或大多数依赖解析器会需要POS标签作为输入)。

If you need hierarchical structure, I'd recommend you consider just starting with a parser. Most parsers I'm aware of include POS tagging during parsing, and may provide higher accuracy tagging than finite-state POS taggers (Caveat - I'm much more familiar with constituent parsers than with dependency parsers. It's possible some or most dependency parsers would require POS tags as input).

解析的一大缺点是时间复杂度。有限状态POS标记器通常以每秒数千个字的速度运行。即使是贪婪的依赖解析器也要慢得多,组成解析器通常以每秒1-5个句子运行。因此,如果您不需要分层结构,您可能希望坚持使用有限状态POS标记符来提高效率。

The big downside to parsing is the time complexity. Finite-state POS taggers often run at thousands of words per second. Even greedy dependency parsers are considerably slower, and constituent parsers generally run at 1-5 sentences per second. So if you don't need hierarchical structure, you probably want to stick with a finite-state POS tagger for efficiency.

如果您确定需要解析结构,一些建议:

If you do decide you need parse structure, a few recommendations:

我认为@aab建议的Stanford解析器包括一个成分解析器和一个依赖解析器。

I think the Stanford parser suggested by @aab includes both a constituent parser and a dependency parser.

伯克利分析器()非常好已知的PCFG成分解析器,达到了最先进的准确度(我相信等于或优于斯坦福解析器),并且效率相当高(每秒约3-5个句子)。

The Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ) is a pretty well-known PCFG constituent parser, achieves state-of-the-art accuracy (equal or superior to the Stanford parser, I believe), and is reasonably efficient (~3-5 sentences per second).

BUBS解析器()也可以使用高精度的伯克利语法运行,并将效率提高到大约15-20句/秒。完全公开 - 我是研究这个解析器的主要研究人员之一。

The BUBS Parser ( http://code.google.com/p/bubs-parser/ ) can also run with the high-accuracy Berkeley grammar, and improves efficiency to around 15-20 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.

警告:这两个解析器都是研究代码,产生了所有问题。但我很乐意看到人们真正使用BUBS,所以如果它对您有用,请试一试并与我联系,提出问题,意见,建议等。

Warning: both of these parsers are research code, with all the problems that engenders. But I'd love to see people actually using BUBS, so if it's of use to you, give it a try and contact me with problems, comments, suggestions, etc.

如果需要,还有一些维基百科参考背景:

And a couple Wikipedia references for background if needed:


  • 无上下文语法:

依赖语法:

这篇关于Java的简单自然语言处理启动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-30 22:47