问题描述
我正在尝试从各种文档中提取所有文本.为此,我使用的是 Apache Tika 1.4.
I am trying to extract all the text out of various documents.And for that I am using Apache Tika 1.4.
RecursiveTikaParser parser = new RecursiveTikaParser(new AutoDetectParser());
ParseContext parseContext = new ParseContext();
parseContext.set(Parser.class, parser);
这里的 RecursiveTikaParser 只是 AutoDetectParser 的一个包装器.
RecursiveTikaParser here is just a wrapper on AutoDetectParser.
解析方法是这样的 -
Parse method for which is something like this -
ContentHandler content = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
super.parse(stream, content, metadata, context);
System.out.println("Parsed text is " + content.toString());
现在,此代码必须能够处理多个文件,这就是我使用 AutoDetectParser() 的原因
Now, this code has to be able to handle multiple files so that's why I am using AutoDetectParser()
我在测试中注意到给定一个 xml 文件 - 我只能提取标签之间的文本,而不能提取注释、标签.
I noticed in my testing that given an xml file - I can only extract the text that is between the tags and not the comments, tags.
是否可以使用我目前的方法从文本文件中提取所有内容?
Is it possible to extract everything from the text file with my current approach ?
推荐答案
试试这个
Metadata metadata = new Metadata();
stream = TikaInputStream.get(stream, null);
String mimtType = DETECTOR.detect(stream, metadata).toString();
Parser parser;
if (mimtType.equalsIgnoreCase("application/xml")) {
parser = new TXTParser();
} else {
parser = new AutoDetectParser();
}
ContentHandler content = new BodyContentHandler();
parser.parse(stream, content, metadata, new ParseContext());
System.out.println(content.toString());
这篇关于使用 apach tika 解析器从 XML 文件中的 xml 标签中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!