本文介绍了从文本中解析含义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我意识到这是一个广泛的主题,但我正在寻找有关从文本中解析含义的优秀入门书,最好是在 Python 中.作为我想要做的一个例子,如果一个用户写了一篇博文:

曼尼·拉米雷斯今天回归道奇队对阵休斯顿太空人队",

从句子中提取名词的轻量级/简单方法是什么?首先,我想我会将它限制为专有名词,但我不想仅限于此(我不想依赖一个简单的正则表达式,假设任何标题上限都是专有名词).

让这个问题变得更糟的是,我没有要求我应该做的事情是什么?我是否需要现有单词的语料库才能开始使用?我需要知道哪些词法分析才能使这项工作发挥作用?我确实遇到了另一个问题主题,我现在正在挖掘这些资源.

解决方案

使用 NLTK,特别是第 7 章信息提取.

你说你想提取意义,并且有语义分析模块,但我认为 IE 就是你所需要的——老实说,这是 NLP 计算机目前唯一可以处理的领域之一.

请参阅第 7.5 和 7.6 节关于命名实体识别(将 Manny Ramerez 作为个人、道奇队作为体育组织、休斯顿太空人队作为另一个体育组织或任何适合您的领域的组织)和关系提取的子主题.安装 NLTK 后,您可以插入一个 NER 分块器.从他们的示例中,提取一个地缘政治实体 (GPE) 和一个人:

>>>发送 = nltk.corpus.treebank.tagged_sents()[22]>>>打印 nltk.ne_chunk(sent)(S/DT(GPE 美国/NNP)是/VBZ一张/CD...根据/VBG至/至(布鲁克人/NNP T./NNP Mossman/NNP)...)

请注意,如前几章所述,您仍需要了解标记化和标记,以便为这些 IE 任务以正确的格式获取文本.

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:

"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",

what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).

To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other question on the topic and I'm digging through those resources now.

解决方案

Use the NLTK, in particular chapter 7 on Information Extraction.

You say you want to extract meaning, and there are modules for semantic analysis, but I think IE is all you need--and honestly one of the only areas of NLP computers can handle right now.

See sections 7.5 and 7.6 on the subtopics of Named Entity Recognition (to chunk and categorize Manny Ramerez as a person, Dodgers as a sports organization, and Houston Astros as another sports organization, or whatever suits your domain) and Relationship Extraction. There is a NER chunker that you can plugin once you have the NLTK installed. From their examples, extracting a geo-political entity (GPE) and a person:

>>> sent = nltk.corpus.treebank.tagged_sents()[22]
>>> print nltk.ne_chunk(sent)
(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  ...
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ...)

Note you'll still need to know tokenization and tagging, as discussed in earlier chapters, to get your text in the right format for these IE tasks.

这篇关于从文本中解析含义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 01:21