在LUCENE-5472中,将Lucene更改为:如果期限太长则抛出错误,而不仅仅是记录一条消息。此错误表明SOLR不接受大于32766的 token
Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[10, 10, 70, 111, 117, 110, 100, 32, 116, 104, 105, 115, 32, 111, 110, 32, 116, 104, 101, 32, 119, 101, 98, 32, 104, 111, 112, 101, 32, 116]...', original message: bytes can be at most 32766 in length; got 43225
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:671)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
... 54 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 43225
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
为了解决此问题,我在架构中添加了两个过滤器(粗体):
<field name="text" type="text_en_splitting" termPositions="true" termOffsets="true" termVectors="true" indexed="true" required="false" stored="true"/>
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
**<filter class="solr.TruncateTokenFilterFactory" prefixLength="32700"/>
<filter class="solr.LengthFilterFactory" min="2" max="32700" />**
</analyzer>
</fieldType>
由于错误仍然相同(这使我认为筛选器设置不正确,也许吗?)
由于Bashetti先生,重新启动服务器是关键
问题是哪个更好:
LengthFilterFactory
或TruncateTokenFilterFactory
?并且假设字节是一个字符是否正确(因为过滤器应删除“不寻常”的字符?)谢谢!
最佳答案
错误显示"SOLR doesn't accept token larger than 32766"
问题是因为您之前为字段文本使用了String fieldType,并且由于更改后您没有重新启动solr服务器,所以在更改字段类型后当前也遇到了同样的问题。
我认为不需要添加TruncateTokenFilterFactory
或LengthFilterFactory
。
但是,剩下的就是您和您的要求。
关于solr - 如何在SOLR中处理 “Document contains at least one immense term”?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/37070593/