使用 StAX 为 XML 创建索引以便快速访问

本文介绍了使用 StAX 为 XML 创建索引以便快速访问的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！有没有办法使用 StAX 和 JAX-B 创建索引，然后快速访问 XML 文件?我有一个很大的 XML 文件，我需要在其中查找信息.这用于桌面应用程序，因此它应该适用于 RAM 很少的系统.所以我的想法是:创建一个索引，然后快速访问大文件中的数据.我不能只拆分文件，因为它是我想原封不动地使用的官方联邦数据库.使用 XMLStreamReader 我可以快速找到一些元素，然后使用 JAXB 解组元素. final XMLStreamReader r = xf.createXMLStreamReader(filename, new FileInputStream(filename));最终 JAXBContext ucontext = JAXBContext.newInstance(Foo.class);最终解组器 unmarshaller = ucontext.createUnmarshaller();r.nextTag();而 (r.hasNext()) {最终 int eventType = r.next();if (eventType == XMLStreamConstants.START_ELEMENT && r.getLocalName().equals("foo")&&Long.parseLong(r.getAttributeValue(null, "bla")) == bla){//JAX-B 工作得很好:最终的 JAXBElementfoo = unmarshaller.unmarshal(r,Foo.class);System.out.println(foo.getValue().getName());//但是我如何获得偏移量?//cache.put(r.getAttributeValue(null, "id"), r.getCursor());//???休息;}}但我无法获得偏移量.我想用它来准备一个索引:(元素的id) ->(文件中的偏移量)然后我应该能够使用偏移量从那里解组:打开文件流，跳过那么多字节，解组.我找不到这样做的图书馆.而且我不能在不知道文件光标位置的情况下自己做.javadoc 明确指出有是一个游标，但我找不到访问它的方法.我只是想提供一个可以在旧硬件上运行的解决方案，以便人们可以实际使用它.不是每个人都能买得起新的、功能强大的计算机.使用 StAX 我可以在大约 2 秒内获取数据，这有点长.但它不需要内存.仅使用 JAX-B 就需要 300 MB 的 RAM.对于这样一个简单的任务，使用一些嵌入式数据库系统只会带来很多开销.无论如何我都会使用 JAX-B.因为 wsimport 生成的类已经很完美了，所以其他任何东西对我来说都是无用的.当我只需要几个对象时，我只是不想加载 300 MB 的对象.我找不到只需要 XSD 来创建内存数据库的数据库，该数据库不使用那么多 RAM.这一切都是为服务器制作的，或者需要定义模式并映射 XML.所以我假设它不存在. 解决方案您可以使用 ANTLR4.以下在 ~17GB 维基百科转储上效果很好/20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 但我不得不使用 -xX6GB 增加堆大小.1.获取 XML 语法cd/tmpgit 克隆 https://github.com/antlr/grammars-v42.生成解析器cd/tmp/grammars-v4/xml/mvn 全新安装3.将生成的 Java 文件复制到您的项目cp -r target/generated-sources/antlr4/path/to/your/project/gen4.与 Listener 挂钩以收集字符偏移package stack43366566;导入 java.util.ArrayList;导入 java.util.List;导入 org.antlr.v4.runtime.ANTLRFileStream;导入 org.antlr.v4.runtime.CommonTokenStream;导入 org.antlr.v4.runtime.tree.ParseTreeWalker；导入 stack43366566.gen.XMLLexer；导入 stack43366566.gen.XMLParser；导入 stack43366566.gen.XMLParser.DocumentContext;导入 stack43366566.gen.XMLParserBaseListener；公共类 FindXmlOffset {列表偏移量 = 空；字符串 searchForElement = null;公共类 MyXMLListener 扩展了 XMLParserBaseListener {公共无效输入元素(XMLParser.ElementContext ctx){字符串名称 = ctx.Name().get(0).getText();如果(searchForElement.equals(名称)){offsets.add(ctx.start.getStartIndex());}}}公共列表createOffsets(字符串文件，字符串元素名称){searchForElement = 元素名称；偏移量 = 新的 ArrayList();尝试 {XMLLexer 词法分析器 = new XMLLexer(new ANTLRFileStream(file));CommonTokenStream 令牌 = new CommonTokenStream(lexer);XMLParser parser = new XMLParser(tokens);DocumentContext ctx = parser.document();ParseTreeWalker walker = new ParseTreeWalker();MyXMLListener 监听器 = new MyXMLListener();步行者.步行(听者，ctx)；返回偏移量；} 捕获(异常 e){抛出新的运行时异常(e)；}}公共静态无效主(字符串 [] arg){System.out.println("搜索偏移量.");列表offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",页");System.out.println("偏移量:" + 偏移量);}}5.结果打印:偏移量:[2441、10854、30257、51419 ....6.从偏移位置读取为了测试我编写的代码，该类在每个维基百科页面中读取到一个 java 对象@JacksonXmlRootElement类页{公共页面(){};公共字符串标题；}基本上使用这个代码private Page readPage(Integer offset, String filename) {尝试(读取器输入 = 新文件读取器(文件名)){in.skip(offset);ObjectMapper mapper = new XmlMapper();mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);页面对象 = mapper.readValue(in, Page.class);返回对象；} 捕获(异常 e){抛出新的运行时异常(e)；}}在 github 上找到完整的示例.>Is there a way to use StAX and JAX-B to create an index and then get quick access to an XML file?I have a large XML file and I need to find information in it. This is used in a desktop application and so it should work on systems with few RAM.So my idea is this: Create an index and then quickly access data from the large file.I can't just split the file because it's an official federal database that I want to use unaltered.Using a XMLStreamReader I can quickly find some element and then use JAXB for unmarshalling the element. final XMLStreamReader r = xf.createXMLStreamReader(filename, new FileInputStream(filename)); final JAXBContext ucontext = JAXBContext.newInstance(Foo.class); final Unmarshaller unmarshaller = ucontext.createUnmarshaller(); r.nextTag(); while (r.hasNext()) { final int eventType = r.next(); if (eventType == XMLStreamConstants.START_ELEMENT && r.getLocalName().equals("foo") && Long.parseLong(r.getAttributeValue(null, "bla")) == bla ) { // JAX-B works just fine: final JAXBElement<Foo> foo = unmarshaller.unmarshal(r,Foo.class); System.out.println(foo.getValue().getName()); // But how do I get the offset? // cache.put(r.getAttributeValue(null, "id"), r.getCursor()); // ??? break; } }But I can't get the offset. I'd like to use this to prepare an index:(id of element) -> (offset in file)Then I should be able use the offset to just unmarshall from there: Open file stream, skip that many bytes, unmarshall.I can't find a library that does this. And I can't do it on my own without knowing the position of the file cursor. The javadoc clearly states that there is a cursor, but I can't find a way of accessing it.Edit:I'm just trying to offer a solution that will work on old hardware so people can actually use it. Not everyone can afford a new and powerful computer. Using StAX I can get the data in about 2 seconds, which is a bit long. But it doesn't require RAM. It requires 300 MB of RAM to just use JAX-B. Using some embedded db system would just be a lot of overhead for such a simple task. I'll use JAX-B anyway. Anything else would be useless for me since the wsimport-generated classes are already perfect. I just don't want to load 300 MB of objects when I only need a few.I can't find a DB that just needs an XSD to create an in-memory DB, which doesn't use that much RAM. It's all made for servers or it's required to define a schema and map the XML. So I assume it just doesn't exist. 解决方案 You could work with a generated XML parser using ANTLR4.The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.1. Get XML Grammarcd /tmpgit clone https://github.com/antlr/grammars-v42. Generate Parsercd /tmp/grammars-v4/xml/mvn clean install3. Copy Generated Java files to your Projectcp -r target/generated-sources/antlr4 /path/to/your/project/gen4. Hook in with a Listener to collect character offsetspackage stack43366566;import java.util.ArrayList;import java.util.List;import org.antlr.v4.runtime.ANTLRFileStream;import org.antlr.v4.runtime.CommonTokenStream;import org.antlr.v4.runtime.tree.ParseTreeWalker;import stack43366566.gen.XMLLexer;import stack43366566.gen.XMLParser;import stack43366566.gen.XMLParser.DocumentContext;import stack43366566.gen.XMLParserBaseListener;public class FindXmlOffset { List<Integer> offsets = null; String searchForElement = null; public class MyXMLListener extends XMLParserBaseListener { public void enterElement(XMLParser.ElementContext ctx) { String name = ctx.Name().get(0).getText(); if (searchForElement.equals(name)) { offsets.add(ctx.start.getStartIndex()); } } } public List<Integer> createOffsets(String file, String elementName) { searchForElement = elementName; offsets = new ArrayList<>(); try { XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file)); CommonTokenStream tokens = new CommonTokenStream(lexer); XMLParser parser = new XMLParser(tokens); DocumentContext ctx = parser.document(); ParseTreeWalker walker = new ParseTreeWalker(); MyXMLListener listener = new MyXMLListener(); walker.walk(listener, ctx); return offsets; } catch (Exception e) { throw new RuntimeException(e); } } public static void main(String[] arg) { System.out.println("Search for offsets."); List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml", "page"); System.out.println("Offsets: " + offsets); }}5. ResultPrints:Offsets: [2441, 10854, 30257, 51419 ....6. Read from Offset PositionTo test the code I've written class that reads in each wikipedia page to a java object@JacksonXmlRootElementclass Page { public Page(){}; public String title;}using basically this codeprivate Page readPage(Integer offset, String filename) { try (Reader in = new FileReader(filename)) { in.skip(offset); ObjectMapper mapper = new XmlMapper(); mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false); Page object = mapper.readValue(in, Page.class); return object; } catch (Exception e) { throw new RuntimeException(e); } }Find complete example on github. 这篇关于使用 StAX 为 XML 创建索引以便快速访问的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！