本文介绍了自然语言处理的词频算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果没有获得信息检索学位,我想知道是否存在任何算法来计算给定文本正文中单词出现的频率.目标是对人们在一组文本评论中所说的话有一个总体感觉".沿着 Wordle 的路线.

Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle.

我想要什么:

  • 忽略冠词、代词等(a"、an"、the"、him"、them"等)
  • 保留专有名词
  • 忽略连字符,软类型除外

触及星星,这些会是桃色:

Reaching for the stars, these would be peachy:

  • 处理词干和复数(例如喜欢、喜欢、喜欢、喜欢匹配相同的结果)
  • 将形容词(副词等)与其主语分组(伟大的服务",而不是伟大的"、服务")

我已经使用 Wordnet 尝试了一些基本的东西,但我只是盲目地调整东西,并希望它适用于我的特定数据.更通用的东西会很棒.

I've attempted some basic stuff using Wordnet but I'm just tweaking things blindly and hoping it works for my specific data. Something more generic would be great.

推荐答案

您需要的不是一个,而是几个不错的算法,如下所示.

You'll need not one, but several nice algorithms, along the lines of the following.

  • 忽略代词是通过stoplist完成的.
  • 保留专有名词?您的意思是,检测命名实体,例如 Hoover Dam 并说这是一个词"或复合名词,例如 programming language?我会给你一个提示:这很难,但两者都有库.寻找 NER(命名实体识别)和词块.OpenNLP 是一个 Java 工具包,可以做到这两者.
  • 忽略连字符?你的意思是,比如在换行符处?使用正则表达式并通过字典查找验证结果词.
  • 处理复数/词干:您可以查看雪球词干分析器.它很好地解决了这个问题.
  • 将形容词与其名词分组"通常是浅解析的任务.但是,如果您专门寻找定性形容词(好、坏、糟糕、惊人……),您可能会对 情感分析.LingPipe 可以做到这一点,还有更多.
  • ignoring pronouns is done via a stoplist.
  • preserving proper nouns? You mean, detecting named entities, like Hoover Dam and saying "it's one word" or compound nouns, like programming language? I'll give you a hint: that's tough one, but there exist libraries for both. Look for NER (Named entitiy recognition) and lexical chunking. OpenNLP is a Java-Toolkit that does both.
  • ignoring hyphenation? You mean, like at line breaks? Use regular expressions and verify the resulting word via dictionary lookup.
  • handling plurals/stemming: you can look into the Snowball stemmer. It does the trick nicely.
  • "grouping" adjectives with their nouns is generally a task of shallow parsing. But if you are looking specifically for qualitative adjectives (good, bad, shitty, amazing...) you may be interested in sentiment analysis. LingPipe does this, and a lot more.

对不起,我知道你说你想亲吻,但不幸的是,你的要求并不那么容易满足.尽管如此,所有这些都存在工具,如果您不想,您应该能够将它们联系在一起,而不必自己执行任何任务.如果您想自己执行任务,我建议您查看词干提取,这是最简单的.

I'm sorry, I know you said you wanted to KISS, but unfortunately, your demands aren't that easy to meet. Nevertheless, there exist tools for all of this, and you should be able to just tie them together and not have to perform any task yourself, if you don't want to. If you want to perform a task yourself, I suggest you look at stemming, it's the easiest of all.

如果您使用 Java,请将 LuceneOpenNLP 工具包.你会得到非常好的结果,因为 Lucene 已经内置了一个词干分析器和大量的教程.另一方面,OpenNLP 工具包的文档很少,但您不需要太多.您可能还对用 Python 编写的 NLTK 感兴趣.

If you go with Java, combine Lucene with the OpenNLP toolkit. You will get very good results, as Lucene already has a stemmer built in and a lot of tutorial. The OpenNLP toolkit on the other hand is poorly documented, but you won't need too much out of it. You might also be interested in NLTK, written in Python.

我会说你放弃你的最后一个要求,因为它涉及浅解析并且绝对不会改善你的结果.

I would say you drop your last requirement, as it involves shallow parsing and will definetly not impove your results.

啊,顺便说一句.您正在寻找的那个文档-术语-频率-事物的确切术语称为 tf-idf.这几乎是查找术语文档频率的最佳方式.为了正确地做到这一点,您不会绕过使用多维向量矩阵.

Ah, btw. the exact term of that document-term-frequency-thing you were looking for is called tf-idf. It's pretty much the best way to look for document frequency for terms. In order to do it properly, you won't get around using multidimenional vector matrices.

...是的,我知道.在参加了一个关于 IR 的研讨会后,我对 Google 的敬意更加强烈.不过,在 IR 做了一些事情之后,我对他们的尊重也很快下降了.

... Yes, I know. After taking a seminar on IR, my respect for Google was even greater. After doing some stuff in IR, my respect for them fell just as quick, though.

这篇关于自然语言处理的词频算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!