问题描述
我想知道如果有API做文本分析在Java中。东西,可以提取所有文字文本,独立的话,前pressions等东西,可以告知我们是否找到了一个单词是一个数字,日期,年份,名称,货币等。
I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.
我从现在开始的文本分析,所以我只需要一个API来开球。我做了一个网络爬虫,现在我需要的东西来分析下载的数据。需要的方法来计算相关的文本的页面,类似的话,数据类型字和另一个资源数目
I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.
是否有在Java中的文本分析的API?
Are there APIs for text analysis in Java?
编辑:文本挖掘,我想挖掘文本。对Java提供这方面的一个API。
Text-mining, I want to mining the text. An API for Java that provides this.
推荐答案
例如 - 你可能会使用标准库中某些类的java.text
,或使用 StreamTokenizer
(你可能会根据你的要求定制)。但如你所知 - 从网络资源的文本数据通常是有许多正射失误并获得更好的性能,你必须使用类似的模糊标记生成器 - java.text中和其他非标准utils的在这样的背景下也能力有限的
For example - you might use some classes from standard library java.text
, or use StreamTokenizer
(you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.
的所以,我建议你使用常规EX pressions (java.util.regex包),并创建自己的一种标记生成器根据您的需求。的
So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.
P.S。
根据您的需要 - 你可能在原始文本识别模板创建零件状态机分析器。您可能会看到简单的状态机识别的图片下面(你可以构建更高级的分析器,它可在文本识别更为复杂的模板)。
P.S.According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).
这篇关于是否有在Java中的文本分析/挖掘的API?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!