问题描述
我对 Node.js 比较陌生.我正在尝试将 83 个大小约为 400MB 的 XML 文件转换为 JSON.
I'm relatively new to Node.js. I'm trying to convert 83 XML files that are each around 400MB in size into JSON.
每个文件都包含这样的数据(除了每个元素都有大量的附加语句):
Each file contains data like this (except each element has a large number of additional statements):
<case-file>
<serial-number>75563140</serial-number>
<registration-number>0000000</registration-number>
<transaction-date>20130101</transaction-date>
<case-file-header>
<filing-date>19981002</filing-date>
<status-code>686</status-code>
<status-date>20130101</status-date>
</case-file-header>
<case-file-statements>
<case-file-statement>
<type-code>D10000</type-code>
<text>"MUSIC"</text>
</case-file-statement>
<case-file-statement>
<type-code>GS0351</type-code>
<text>compact discs</text>
</case-file-statement>
</case-file-statements>
<case-file-event-statements>
<case-file-event-statement>
<code>PUBO</code>
<type>A</type>
<description-text>PUBLISHED FOR OPPOSITION</description-text>
<date>20130101</date>
<number>28</number>
</case-file-event-statement>
<case-file-event-statement>
<code>NPUB</code>
<type>O</type>
<description-text>NOTICE OF PUBLICATION</description-text>
<date>20121212</date>
<number>27</number>
</case-file-event-statement>
</case-file-event-statements>
我尝试了很多不同的 Node 模块,包括 sax、node-xml、node-expat 和 xml2json.显然,我需要从文件中流式传输数据并通过 XML 解析器进行管道传输,然后将其转换为 JSON.
I have tried a lot of different Node modules, including sax, node-xml, node-expat and xml2json. Obviously, I need to stream the data from the file and pipe it through an XML parser and then convert it to JSON.
我还尝试阅读一些博客等,试图解释如何解析 Xml.
I have also tried reading a number of blogs, etc. attempting to explain, albeit superficially, how to parse Xml.
在 Node 世界中,我首先尝试了 sax,但我不知道如何以可以将其转换为 JSON 的格式提取数据.xml2json 不适用于流.node-xml 看起来令人鼓舞,但我无法弄清楚它如何以任何有意义的方式解析块.node-expat 指向 libexpat 文档,这似乎需要博士学位.Node elementree 做同样的事情,指向 Python 实现,但没有解释实现了什么或如何使用它.
In the Node universe, I tried sax first but I can't figure out how to extract the data in a format that I can convert it to JSON. xml2json won't work on streams. node-xml looks encouraging but I can't figure out how it parses chunks in any manner that makes sense. node-expat points to libexpat documentation, which appears to requires a Ph.D. Node elementree does the same, pointing to the Python implementation but doesn't explain what has been implemented or how to use it.
有人可以指出我可以用来入门的示例吗?
Can someone point me to example that I could use to get started?
推荐答案
虽然这个问题很老了,但我分享我的问题&可能对所有尝试将 XML
转换为 JSON
的人有所帮助的解决方案.
Although this question is quite old, I am sharing my problem & solution which might be helpful to all who are trying to convert XML
to JSON
.
这里的实际问题不是转换,而是处理巨大的 XML 文件,而不必一次将它们保存在内存中.
The actual problem here is not the conversion but processing huge XML files without having to hold them in memory at once.
在使用几乎所有广泛使用的包时,我遇到了以下问题 -
Working with almost all widely used packages, I came across following problem -
很多包都支持
XML
到JSON
转换,涵盖所有场景,但它们不适用于大文件.
A lot of packages support
XML
toJSON
conversion covering all scenarios but they don't work well with large files.
很少的包(如 xml-flow、xml-stream) 支持大型 XML 文件转换,但转换过程会遗漏一些极端情况,其中转换要么失败或给出不可预测的 JSON 结构(在这个 SO 问题中解释).
Very few packages (like xml-flow, xml-stream) support large XML file conversion but the conversion process misses out few corner case scenarios where the conversion either fails or gives unpredictable JSON structure (explained in this SO question).
理想的解决方案是结合这两种方法的优点,这正是我所做的并提出了 xtreamer 节点包.
简单来说,xtreamer
像 xml-flow
/xml-stream
一样接受重复节点,但发出重复的 xml 节点而不是转换后的 JSON.这提供了以下优势 -
In simple words, xtreamer
accepts repeating node just like xml-flow
/ xml-stream
but emits repeating xml nodes instead of converted JSON. This provides following advantages -
- 我们可以将
xtreamer
与任何可读流进行管道连接,因为它扩展了transform stream
. - 发出的 XML 节点可以传输到任何 XML 到 JSON 解析器以获取所需的 JSON.
- 我们可以更进一步,将 JSON 解析器与
xtreamer
&它将调用 JSON 解析器并相应地发出 JSON. xtreamer
将stream
作为其唯一的依赖项 &作为转换流扩展,它可以灵活地与其他流进行管道传输.
- We can pipe
xtreamer
with any readable stream as it extendstransform stream
. - The emitted XML nodes can be transferred to any XML to JSON parser to get desired JSON.
- We can go one step further and hook up the JSON parser with
xtreamer
& it will invoke the JSON parser and emit JSON accordingly. xtreamer
hasstream
as its only dependency & being a transform stream extension, it can be piped with other streams flexibly.
如果 XML 结构不固定怎么办?
我设法想出了另一个基于 sax
的节点包 xtagger 它读取 XML 文件并提供以下格式的文件结构 -
I managed to come up with another sax
based node package xtagger which reads the XML file and provides the structure of the file in following format -
structure: { [name: string]: { [hierarchy: number]: number } };
这个包允许找出重复的节点名称,然后可以将其传递给 xtreamer
进行解析.
This package allows to figure out the repeating node name which can then be passed to xtreamer
for parsing.
我希望这会有所帮助.:)
I hope this helps. :)
这篇关于Node.js 示例将 Xml 转换为 JSON 以用于大型 Xml 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!