问题描述
我目前必须修复现有的应用程序,以使用 DOM接口以外的其他东西.的 libxml2 的原因是,事实证明它传递的XML文件太大而无法加载到其中记忆.
我已经将数据加载重写了,从遍历DOM树到使用 xmlTextReader 现在大部分没有太大问题. (我使用xmlNewTextReaderFilename
打开本地文件.)
但是,事实证明,必须按顺序读取大数据所在的子树,但是我必须先收集一些(少量)数据. (问题恰恰是该子树包含大量数据,因此仅将此子树加载到内存中也没有多大意义.)
最简单的方法是只是克隆"/复制"我当前的阅读器,先阅读,然后返回原始实例以继续阅读. (似乎我不是第一个 ...甚至有一些东西在C#端实现:带有书签的XML阅读器.)
但是似乎没有任何方法可以复制" xmlTextReader的状态.
如果我无法重新读取文件的部分,我还可以重新读取整个文件,尽管这很浪费,但在这里还是可以的,但我仍然需要记住我之前在哪里?
是否可能有一种简单的方式来记住xmlTextReader在当前文档中的位置,以便以后在再次读取文档/文件时可以再次找到该位置?
这是一个问题示例:
<root>
<cat1>
<data attrib="x1">
... here goes up to one GB in stuff ...
</data>
<data attrib="y2"> <!-- <<< Want to remember this position without having to re-read the stuff before -->
... even more stuff ...
</data>
<data attrib="z3">
<!-- I need (part of) the data here to meaningfully interpret the data in [y2] that
came before. The best approach would seem to first skip all that data
and then start back there at <data attrib="y2"> ... not having to re-read
the whole [x1] data would be a big plus! -->
</data>
</cat1>
...
</root>
我想从从XML邮件列表中了解到:
没有简单的方法可以在xmlReader上克隆"状态,但是应该可以并且应该非常容易的是对文档上的读取进行计数.
也就是说,要使用xmlReader读取文档,您可能必须调用以下内容:
// looping ...
status = ::xmlTextReaderRead(pReader);
假设您以结构化的方式进行操作(例如,我最终编写了一个包装我的xmlReader使用模式的包装器类),那么添加计数器相对容易:
// looping ...
status = ::xmlTextReaderRead(pReader);
if (1 == status) { // success
++m_ReadCounter;
}
要重新阅读文档(到达某个位置),您只需多次调用xmlTextReaderRead
m_ReadCounter
次,丢弃结果,直到到达要重新开始的位置.
是的,您必须重新解析整个文档,但这可能足够快. (而且实际上可能比缓存文档中很大一部分的缓存更好/更快.)
I currently have to fix an existing application to use something other than the DOM interface of libxml2 because it turns out it gets passed XML files so large that they can't be loaded into memory.
I have rewritten the data loading from iterating over the DOM tree to using xmlTextReader for the most part now without too much problems. ( I use xmlNewTextReaderFilename
to open a local file.)
It turns out however, that the subtree where the large data resides has to be read not in-order, but I have to collect some (small amount of) data before the other. (And the problem is exactly that it is this subtree that contains the large volume of data, so loading only this subtree into memory doesn't make much sense either.)
The easiest thing would be to just "clone" / "copy" my current reader, read ahead and then return to the original instance to continue reading there. (Seems I'm not the first one ... There's even something implemented on the C# side: XML Reader with Bookmarks.)
There doesn't appear to be any way however to "copy" the state of an xmlTextReader.
If I can't re-read part of a file, I could also re-read the whole file, which, although wasteful, would be OK here, but I still would need to remember where I was beforehand?
Is there maybe a simple way to remember for a xmlTextReader where it is in the current document, so that I can later find that position again when reading the document/file a second time?
Here's a problem example:
<root>
<cat1>
<data attrib="x1">
... here goes up to one GB in stuff ...
</data>
<data attrib="y2"> <!-- <<< Want to remember this position without having to re-read the stuff before -->
... even more stuff ...
</data>
<data attrib="z3">
<!-- I need (part of) the data here to meaningfully interpret the data in [y2] that
came before. The best approach would seem to first skip all that data
and then start back there at <data attrib="y2"> ... not having to re-read
the whole [x1] data would be a big plus! -->
</data>
</cat1>
...
</root>
I would like to give a workaround answer from what I learned at the XML mailing list:
There is no easy way to "clone" the state on an xmlReader, however what should be possible and should also be pretty easy is counting the reads one did on a document.
That is, to read a document with xmlReader, you have to probably invoke the following:
// looping ...
status = ::xmlTextReaderRead(pReader);
Provided you do that in a structured way (for example, I ended up writing a little wrapper class that encapsulates my usage pattern for xmlReader), it is then relatively easy to add a counter:
// looping ...
status = ::xmlTextReaderRead(pReader);
if (1 == status) { // success
++m_ReadCounter;
}
For re-reading a document (reaching a certain position), you then just call xmlTextReaderRead
a number of m_ReadCounter
times, discarding the results until you reach the position where you want to start again.
Yes, you have to re-parse the whole document, but that may be fast enough. (And may actually be better/faster than caching a very large volume part of the document.)
这篇关于是否有可能克隆xmlTextReader(或多次读取)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!