本文介绍了Lucene 分析器的比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释一下 Lucene 中不同分析器之间的区别吗?我收到了一个 maxClauseCount 异常,我知道我可以通过使用 KeywordAnalyzer 来避免这种情况,但我不想在不了解分析器相关问题的情况下从 StandardAnalyzer 进行更改.非常感谢.

Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.

推荐答案

一般来说,Lucene 中的任何分析器都是分词器 + 词干分析器 + 停用词过滤器.

In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.

Tokenizer 将您的文本分割成块,并且由于不同的分析器可能使用不同的标记器,您可以获得不同的输出 token 流,即文本块的序列.例如,您提到的KeywordAnalyzer 根本不会拆分文本,而是将所有字段作为单个标记.同时,StandardAnalyzer(和大多数其他分析器)使用空格和标点符号作为分割点.例如,对于短语我很高兴",它将产生列表 [i"、am"、very"、happy"](或类似的东西).有关特定分析器/标记器的更多信息,请参阅其 Java 文档.

Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.

词干用于获取相关单词的基础.这在很大程度上取决于所使用的语言.例如,对于英语中的前一个短语,会产生类似 ["i", "be", "veri", "happi"] 的东西,而对于法语 "Je suis très heureux" 会产生某种法语分析器(如 SnowballAnalyzer, 用 "French" 初始化) 将产生 ["je", "être", "tre", "heur"].当然,如果您将使用一种语言的分析器来提取另一种语言的文本,则将使用另一种语言的规则,并且词干分析器可能会产生不正确的结果.并非所有系统都失败,但搜索结果可能不太准确.

Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.

KeywordAnalyzer 不使用任何词干分析器,它通过未修改的所有字段.所以,如果你要在英文文本中搜索一些单词,使用这个分析器不是一个好主意.

KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.

停用词是最常见且几乎无用的词.同样,它在很大程度上取决于语言.对于英语,这些词是a"、the"、I"、be"、have"等.停用词过滤器将它们从标记流中删除以降低搜索结果中的噪音,所以最后我们的短语IStandardAnalyzer 中的非常高兴"将转换为列表 ["veri", "happi"].

Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].

KeywordAnalyzer 又什么也不做.因此,KeywordAnalyzer 用于 ID 或电话号码等内容,但不适用于普通文本.

And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.

至于您的 maxClauseCount 异常,我相信您在搜索时会得到它.在这种情况下,很可能是因为搜索查询过于复杂.尝试将其拆分为多个查询或使用更多低级函数.

And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.

这篇关于Lucene 分析器的比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-06 04:33