本文介绍了Tika Parser:排除PDF附件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个PDF文档,其中包含Tika不应提取的附件(此处为joboptions).内容不应发送到Solr.有什么方法可以在Tika配置中排除某些(或全部)PDF附件?

There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certain (or all) PDF attachments in the Tika config?

推荐答案

实施自定义org.apache.tika.extractor.DocumentSelector并将其设置为ParseContext.会使用嵌入文档的元数据调用DocumentSelector,以决定是否应解析嵌入文档.

Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.

示例文档选择器:

public class CustomDocumentSelector implements DocumentSelector {

  @Override
  public boolean select(Metadata metadata) {
    String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
    return resourceName == null || !resourceName.endsWith(".joboptions");
  }
}

在ParseContext中注册它:

parseContext.set(DocumentSelector.class, new CustomDocumentSelector());

这篇关于Tika Parser:排除PDF附件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-27 15:35