Solr Lucene 中的连字符/破折号挑战

本文介绍了Solr Lucene 中的连字符/破折号挑战的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图让 Solr 仅提取格式为 n-nnnnnnn 的票证的第二个 7 位数字部分

I'm trying to cause Solr to extract only the second 7 digit portion of a ticket formatted like n-nnnnnnn

本来是希望把全票保留在一起的.根据文档，数字和数字应该放在一起，但是在解决这个问题一段时间并查看代码之后，我认为情况并非如此.Solr 总是生成两个术语.因此，与其对 n 的第一位数字进行大量匹配，我认为我可以仅从第二部分获得更好的查询结果.用 A 代替破折号:

Originally I hoped to keep the full ticket together. According to documentation digits with numbers should be kept together but after hammering away a this problem for some time and looking at the code I don't think that's the case. Solr always generates two terms. So rather than large numbers of matches for the first digit of n- I'm thinking I can get better query results from just the second portion. Substituting an A for a dash:

    <charFilter class="solr.PatternReplaceCharFilterFactory"
      pattern="d[A](ddddddd)" replacement="$1" replace="all"
      maxBlockChars="20000"/>

将解析 1A1234567 很好但-" 替换="$1" 替换="全部"maxBlockChars="20000"/>

will parse 1A1234567 fineBut -" replacement="$1" replace="all" maxBlockChars="20000"/>

不会解析 1-1234567

will not parse 1-1234567

所以看起来只是连字符的问题.我试过 -(escaped) 和 [-] 和 u002D 和 x{45} 和 x045 没有成功.

So it looks like just a problem with the hyphen. I've tried -(escaped) and [-] and u002D and x{45} and x045 without success.

我试过在它周围放置字符过滤器:

I've tried putting char filters around it:

   <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
    <charFilter class="solr.PatternReplaceCharFilterFactory"
      pattern="d[-](ddddddd)" replacement="$1" replace="all" maxBlockChars="20000"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping2.txt"/>

带有映射:

"-" => "z"

然后

"z" => "-"

我看起来连字符在 Flex 标记化中被吃掉了，甚至无法用于字符过滤器.

I looks like the hyphen is eaten up in the Flex tokenization and isn't even available to the char filter.

有没有人在 Solr/Lucene 中使用连字符/破折号更成功?谢谢

Has anyone had more success with hyphen/dash in Solr/Lucene? Thanks

破折号挑战

Solr Lucene 中的连字符/破折号挑战

问题描述

推荐答案