本文介绍了Ruby 文本分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何 Ruby gem 或其他用于文本分析的工具?词频、模式检测等(最好懂法语)

Is there any Ruby gem or else for text analysis? Word frequency, pattern detection and so forth (preferably with an understanding of french)

推荐答案

词频的泛化是语言模型,例如uni-grams(= 单词频率),bi-grams(= 词对的频率),tri-grams(=世界三元组的频率),...,一般来说:n-grams

the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams

您应该寻找现有的语言模型工具包 - 在这里重新发明轮子不是一个好主意.

You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.

有一些可用的标准工具包,例如来自 CMU Sphinx 团队以及 HTK.

There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.

这些工具包通常是用 C 编写的(为了速度!!因为你必须处理庞大的语料库)并生成标准输出格式的 ARPA n-gram 文件(那些通常是文本格式)

These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)

检查以下线程,其中包含更多详细信息和链接:

Check the following thread, which contains more details and links:

构建 openears 兼容的语言模型

使用这些工具包中的一个生成语言模型后,您将需要一个 Ruby Gem 使语言模型可在 Ruby 中访问,或者您需要将 ARPA 格式转换为您自己的格式.

Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.

adi92 的帖子列出了更多 Ruby NLP 资源.

adi92's post lists some more Ruby NLP resources.

您也可以在 Google 上搜索ARPA 语言模型"以获取更多信息

You can also Google for "ARPA Language Model" for more info

最后,请查看 Google 的在线 N-gram 工具.他们根据数字化的书籍构建了 n-gram——也有法语和其他语言版本!

Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!

这篇关于Ruby 文本分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 19:15