本文介绍了自然语言处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题可能(大约 100%)是主观的,但我需要建议.什么是自然语言处理的最佳语言?我知道 Java 和 C++,但有没有更简单的方法来做到这一点.更具体地说,我需要处理来自许多站点的文本并获取信息.

Question is maybe ( about 100%) subjective but I need advices. What is best language for natural language processing ? I know Java and C++ but is there easier way to do it. To be more specific I need to process texts from lot of sites and get information.

推荐答案

正如我在评论中所说,问题不在于语言,而在于合适的库.Java 和 C++ 中有很多 NLP 库.我相信您必须检查其中的一些(两种语言),然后当您了解所有大量可用库时,创建某种大计划",如何实施您的任务.所以,在这里我只给你一些链接,并简要说明什么是什么.

As I said in comments, the question is not about a language, but about suitable library. And there are a lot of NLP libraries in both Java and C++. I believe you must inspect some of them (in both languages) and then, when you will know all the plenty of available libraries, create some kind of "big plan", how to implement your task. So, here I'll just give you some links with a brief explanation what is what.

GATE - 顾名思义 - 通用架构文本处理.GATE 中的应用程序是一个管道.您将语言处理资源(如分词器、词性标注器、形态分析器等)放在上面并运行该过程.结果表示为一组注释 - 元信息,附加到文本的和平(例如令牌).除了大量插件(包括用于与其他 NLP 资源集成的插件)像 WordNet 或 Stanford Parser)一样,它有许多预定义的词典(城市、名称等)和自己的类似正则表达式的语言 JAPE.GATE 带有自己的 IDE(GATE Developer),您可以在其中尝试管道设置,然后保存并从 Java 代码加载.

GATE - it is exactly what its name means - General Architecture for Text Processing. Application in GATE is a pipeline. You put language processing resources like tokenizers, POS-taggers, morphological analyzers, etc. on it and run the process. The result is represented as a set of annotations - meta information, attached to a peace of text (e.g. token). In addition to great number of plugins (including plugins for integration with other NLP resources like WordNet or Stanford Parser), it has many predefined dictionaries (cities, names, etc.) and its own regex-like language JAPE. GATE comes with its own IDE (GATE Developer), where you can try your pipeline setup, and then save it and load from Java code.

UIMA - 或非结构化信息管理应用程序.它在架构上与 GATE 非常相似.它还表示管道并生成一组注释.与 GATE 一样,它具有可视化 IDE,您可以在其中试用您未来的应用程序.不同之处在于 UIMA 主要关注信息提取,而 GATE 执行文本处理而没有明确考虑其目的.UIMA 还带有简单的 REST 服务器.

UIMA - or Unstructured Information Management Applications. It is very similar to GATE in terms of architecture. It also represents pipeline and produces set of annotations. Like GATE, it has visual IDE, where you can try out your future application. The difference is that UIMA mostly concerns information extraction while GATE performs text processing without explicit consideration of its purpose. Also UIMA comes with simple REST server.

OpenNLP - 他们称自己为开源项目的组织中心在 NLP 上,这是最合适的定义.主要发展方向是将机器学习算法用于最通用的 NLP 任务,如词性标注、命名实体识别、共指解析等.它还与 UIMA 有很好的集成,因此它的工具也是可用的.

OpenNLP - they call themselves organization center for open source projects on NLP, and this is the most appropriate definition. Main direction of development is to use machine learning algorithms for the most general NLP tasks like part-of-speech tagging, named entity recognition, coreference resolution and so on. It also has good integration with UIMA, so its tools are also available.

Stanford NLP - 可能是拥有 NLP 和机器学习知识.与 GATE 和 UIMA 等库不同,它的目标不是提供尽可能多的工具,而是专注于惯用的模型.例如.你没有全面的词典,但你可以训练概率算法来创建它!除了 CoreNLP 组件之外,它还提供了最常用的工具,如标记化、词性标注、NER等,它有几个非常有趣的子项目.例如.他们的依赖框架允许您提取完整的句子结构.也就是说,例如,您可以轻松提取有关动词的主语和宾语信息,而使用其他 NLP 工具则要困难得多.

Stanford NLP - probably best choice for engineers and researchers with NLP and ML knowledge. Unlike libraries like GATE and UIMA, it doesn't aim to provide as much tools as possible, but instead concentrates on idiomatic models. E.g. you don't have comprehensive dictionaries, but you can train probabilistic algorithm to create it! In addition to its CoreNLP component, that provides most wildly used tools like tokenization, POS tagging, NER, etc., it has several very interesting subprojects. E.g. their Dependency framework allows you to extract complete sentence structure. That is, you can, for example, easily extract information about subject and object of a verb in question, which is much harder using other NLP tools.

UIMA - 是的,Java 和 C++ 都有完整的实现.

UIMA - yes, there are complete implementations for both Java and C++.

Stanford Parser - 斯坦福大学的一些项目仅使用 Java,其他项目仅使用 C++,其中一些项目同时支持两种语言.您可以在此处找到其中的许多.

Stanford Parser - some Stanford's projects are only in Java, others - only in C++, and some of them are available in both languages. You can find many of them here.

许多网络服务 API 执行特定的语言处理,包括:

A number of web service APIs perform specific language processing, including:

Alchemy API - 语言识别、命名实体识别、情感分析和多得多!看看他们的主页 - 它是非常自我描述的.

Alchemy API - language identification, named entity recognition, sentiment analysis and much more! Take a look at their main page - it is quite self-descriptive.

OpenCalais - 此服务试图构建所有事物的巨型图.你向它传递一个网页 URL,它用找到的实体以及它们之间的关系来丰富这个页面文本.例如,您向它传递一个包含Steve Jobs"的页面,它返回Apple Inc.".(粗略地说)再加上这是同一个史蒂夫·乔布斯的可能性.

OpenCalais - this service tries to build giant graph of everything. You pass it a web page URL and it enriches this page text with found entities, together with relations between them. For example, you pass it a page with "Steve Jobs" and it returns "Apple Inc." (roughly speaking) together with probability that this is the same Steve Jobs.

是的,你绝对应该看看 Python 的 NLTK.它不仅是一个功能强大且易于使用的 NLP 库,而且还是优秀科学堆栈的一部分 由非常友好的社区创建.

And yes, you should definitely take a look at Python's NLTK. It is not only a powerful and easy-to-use NLP library, but also a part of excellent scientific stack created by extremely friendly community.

更新 (2017-11-15):7 年后,出现了更令人印象深刻的工具、酷炫的算法和有趣的任务.可以在此处找到一份全面的描述:

Update (2017-11-15): 7 years later there are even more impressive tools, cool algorithms and interesting tasks. One comprehensive description may be found here:

https://tomassetti.me/guide-natural-language-processing/

这篇关于自然语言处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!