本文介绍了多语言 Solr 搜索索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设置一个可以索引多种语言的 Solr 搜索引擎.我创建了一个自定义 UpdateProcessorFactory 来确定输入文本的哪些部分是哪种语言,然后我将文档的这些部分复制到特定于语言的字段中.例如,使用此文本:

I am setting up a Solr Search Engine that will index multiple languages. I created a custom UpdateProcessorFactory to figure out which sections of the input text are which language, and then I copy those sections of the document into language specific fields. For example, with this text:

你好世界,世界之日,你好世界."

"Hello World, Bonjour le Monde, Hallo Welt."

它将Hello World"复制到 en-text 字段中,将Bonjour le Monde"复制到 fr-text 字段中,将Hallo Welt"复制到 de-text 字段中.每个字段都有适当的语言分析器来对单词进行标记和词干.

It copies "Hello World" into the en-text field, "Bonjour le Monde" into the fr-text field, and "Hallo Welt" into the de-text field. Each field has the appropriate language analyzers to tokenize and stem the words.

最后,我希望有一个框供用户输入可以搜索所有语言的搜索词.搜索词不需要翻译,但应适当地提取词干.实现这一目标的最佳方法是什么?我也很关心搜索的性能.

In the end I would like to have one box for a user to enter search terms that would search across all languages. The search terms don't need to be translated, but they should be stemmed appropriately. What is the best way to accomplish this? I'm also very concerned about the performance of the searches.

推荐答案

最好的方法是使用 DisMaxRequestHandler.它将针对适当的语言(如 schema.xml 中定义)适当地分析每个字段.

The best way is to use the DisMaxRequestHandler. It will appropriately analyze each field for the appropriate language (as defined in schema.xml).

所以,如果您的查询看起来像/solr/select?qt=dismax&qf=en-text%20fr-text%20de-text&q=hello%worldSolr 会做正确的事情.

So, if your query looks like/solr/select?qt=dismax&qf=en-text%20fr-text%20de-text&q=hello%worldSolr will do the right thing.

(假设您在 solrconfig.xml 的 requestHandler 块中将 dismax 配置为 solr.DisMaxRequestHandler)

(assuming you configured dismax as a solr.DisMaxRequestHandler in a requestHandler block in solrconfig.xml)

大多数分析都很快.您的性能界限主要取决于您的索引大小、总术语数等.请务必根据其 wiki 上的 solr 性能指南调整所有内容.我目前正在运行一个 60GB 的索引,并继续在 100 毫秒以下的硬件上进行搜索,而硬件并不是那么花哨.

Most analysis is fast. Your performance bounds are mostly on your index size, total term counts, etc. Be sure to tune everything according to the solr perfomance guide on their wiki. I'm currently running a 60GB index and continue to get searches in the sub 100ms range on hardware that isn't all that fancy.

这篇关于多语言 Solr 搜索索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 18:30