本文介绍了XSLT将文本节点解析为XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我要转换的XML文档中间,有一个CDATA节点,我知道它本身是由XML组成的.我希望将其递归解析"为XML,以便我也可以对其进行转换.搜索后,我认为我的问题与包含内部转义XML的处理节点.

In the middle of an XML document I'm transforming, there is a CDATA node which I know itself is composed of XML. I would like to have that "recursively parsed" as XML so that I can transform it too. Upon searching, I think my question is very similar to Handling node containing inner escaped XML.

那是一年前的事:我可以澄清一下以下内容:

That was a year ago: may I just clarify the following:

  1. 它说某些XSLT不可能一口气做到这一点:相反,您需要一种两阶段的方法.我刚买了一本关于XSLT 2.0的新书.是否仍然没有XSLT指令将字符串节点重新解析"为XML?
  2. 在我的情况下,XML字符串节点只是整个节点中的一个.因此,在阶段1中,我将仅转换输入XML文档的一个片段.其余的则需要不变地转到阶段2.我看到了几种将输入传递到输出的解决方案,但没有改变,但通常看来它们大多有效",但是跳过/不处理某种类型的节点输入.是否有可靠的结构可将其余的输入传递到输出而无需进行任何更改?
  3. 该方法依赖于我能够分别应用2个转换.我限于现有的应用程序,只能进行一次转换(XML输出是固定的;它由一个XSLT文件转换;我唯一能做的就是将我喜欢的东西放入该XSLT文件中,和/或添加其他XSLT文件,但我不能影响通过一个XSLT文件传递XML的顶级调用.有什么我可以放入XSLT文件中的文件,它可以导致调用第二个XSLT转换吗?
  1. It says this cannot be done by some XSLT in one go: rather you need a two-phase approach. I have just bought a shiny new book on XSLT 2.0. Is is still the case that there is no XSLT instruction to "re-parse" a string node as XML?
  2. In my case the XML-string node is just one node in the whole. Therefore in Phase #1 I would only be transforming a fragment of the input XML document; the rest needs passing through unchanged to Phase #2. I see several solutions to passing input to output unchanged, but often it seems they "mostly work", but skip/do not deal with some kind of node inputs. Is there a relaible construct for passing the rest of the input to the output without any changes?
  3. That approach relies on me being able to apply 2 transforms separately. I am limited (existing application) to only being allowed one transform (the XML output is fixed; it is transformed by one XSLT file; the only thing I can do is put whatever I like into that XSLT file, and/or add further XSLT files, but I cannot influence the top-level call to pass the XML through one XSLT file). Is there anything I could put into an XSLT file which could cause the second XSLT transform to be invoked?

推荐答案

最后查看更新.

  1. 最重要的问题.有可能做到;问题是您是否必须在XSLT中手动编写XML解析器,还是使用扩展功能,或者是否有方便,可移植的解决方案. 更新:如果可以使用Saxon的解析()扩展功能,这是迄今为止最好的选择.您可以访问吗?

  1. the most important question. It's possible to do; the question is whether you'd have to write an XML parser manually in XSLT, or use an extension function, or whether there's a convenient, portable solution. Update: If you can use Saxon's parse() extension function, that's by far your best bet. Do you have access to that?

很容易回答:是的,请使用身份转换.这不会保留输入XML的所有词法详细信息,例如属性顺序或<foo/>是否写为<foo></foo>.但是,它将保留所有与XML处理器有关的细节.

is easy to answer: yes, use the identity transform. This will not preserve all lexical details of the input XML, such as order of attributes, or whether <foo/> is written as <foo></foo>. However it will preserve all details that are supposed to matter to XML processors.

但是,如果您不能在管道中运行2个样式表,这对您没有帮助,对吧?

But this won't help you if you can't run 2 stylesheets in a pipeline, right?

嗯...不好.如果您的输出将由浏览器显示,或由其他理解 XML样式表处理指令,您可以输出其中之一,并希望(违反规范的建议!)在该样式表与您关联的样式表之间进行序列化和解析.输出.但这会非常脆弱.我说违反规范的建议"是因为此处它表示

Hmm... not robustly. If your output is going to be displayed by a browser, or handled by something else that understands an XML stylesheet processing instruction, you could output one of those, and hope (against the spec's recommendation!) that serialization and parsing would occur in between this stylesheet and the one you associated on output. But this would be very fragile. I say "against the spec's recommendation" because here it says

表示没有序列化和之间的解析.不推荐.

which would imply, without serialization and parsing in between. Not recommended.

更新:一条新注释表示您不预先知道哪些元素将包含CDATA部分.我得出的结论是,这意味着您不知道哪些元素将包含未解析的数据(因为XML处理器本身实际上不知道或不在乎CDATA部分中的元素).在这种情况下,所有赌注都关闭.您可能知道,XML处理器不应该知道XML输入文档的哪些部分在CDATA部分中. CDATA只是转义标记的一种不同方法,是&lt;等的一种替代方法.一旦解析了数据(在XSLT处理器的管辖权范围内,该数据就不正确),您就无法确定它最初是如何在标记中表示的.无论用<![CDATA[ < ]]>还是&lt;表示,左尖括号仍然是左尖括号.就像在C语言中一样,将字符指定为'A'还是65或0x41都无关紧要;程序一旦编译,您的代码将无法分辨出差异.

Update: a new comment says that you don't know in advance which elements will contain CDATA sections. I jumped to the conclusion that this meant you didn't know which elements would contain unparsed data (since XML processors officially don't know or care what elements are in CDATA sections, per se). In that case, all bets are off. As you may know, XML processors are not supposed to know which parts of an XML input doc are in CDATA sections. CDATA is just a different way of escaping markup, an alternative to &lt; etc. Once the data is parsed (which is not properly under the XSLT processor's jurisdiction), you can't tell how it was initially expressed in markup. A left pointy bracket remains a left pointy bracket whether it's expressed as <![CDATA[ < ]]> or &lt;. Just as in C, it doesn't matter whether you specify a character as 'A' or 65 or 0x41; once the program is compiled, your code won't be able to tell the difference.

因此,如果您没有其他方法来确定需要分析输入文档中的哪些数据,那么上述方法都无法帮助您:您不知道在哪里应用saxon:parse(),既不进行手动解析,也不进行随后的XSLT转换禁用输出转义.

Therefore, if you don't have another way of determining which data in your input document needs to be parsed, then none of the above methods will help you: you can't know where to apply saxon:parse(), nor manual parsing, nor disable-output-escaping with a following XSLT transformation.

解决方法:

  • 您可以猜测,例如使用test="contains(., '&lt;')",哪个节点包含未分析的数据. (请注意,此测试是针对左尖括号,而不管它是表示为字符实体还是CDATA部分的一部分,还是通过其他任何方式表示.)有时您会得到误报,例如如果文本节点包含字符串"year< 2001".或者,您可以尝试解析每个文本节点(效率很低),对于那些成功解析为格式正确的XML文档的节点,请输出树而不是文本.

  • You could guess, e.g. with test="contains(., '&lt;')", which nodes contain unparsed data. (Note this tests for the left pointy bracket, regardless of whether it's expressed as a character entity, or part of a CDATA section, or any other way.) You'd sometimes get false positives, e.g. if a text node contained the string "year < 2001". Or you could attempt to parse every text node (very inefficient), and for those that parse successfully as well-formed XML documents, output the tree instead of the text.

或者您可以使用非XML工具(例如 LexEv )进行预处理,因此,它可以查看" CDATA标记.但是您已经说过,您无法控制单个XSLT之外的任何东西.

Or you could preprocess the XML with a non-XML tool (like LexEv), which therefore can "see" the CDATA markup. But you've said that you can't control anything outside the single XSLT.

或者,理想情况下,您可以向消息链发送一条消息,即所获得的XML不可行:,除了使用CDATA标记外,他们还需要以其他方式进行标记,哪些部分包含未分析的数据.通常,这可以通过指定某些元素名称或使用属性标志来完成.显然,这取决于谁提供XML.

Or, ideally, you could send the message back up the chain that the XML you're being given is unworkable: they need to flag somehow, other than by using CDATA markup, which sections contain unparsed data. Usually this would be done either by specifying certain element names, or by using attribute flags. Obviously this would depend on who's supplying the XML.

另一个更新好的,现在我明白了:因此,您知道哪个元素包含未分析的数据(并且知道它已用CDATA标记),但是您不知道哪些其他数据可能会用CDATA标记.

Another updateOK, now I understand: so you know which element contains unparsed data (and you know it's marked up with CDATA), but you don't know which other data might be marked up with CDATA.

为此,将整个文档的其余部分保留为原始输入"并不一定意味着保留任何CDATA标记. (下游的一般转换将不知道或不在乎CDATA会转义哪些数据.)所需要做的只是解析一个未解析的节点,而其余的则不解析. 身份转换可以很好地完成后者;您可以忽略该页面上有关输出CDATA部分的内容...下游XSLT将不知道或不在乎. (除非您对尚未告知我们的输出有其他(非XML)要求).

For this purpose, "leaving the whole of the rest of the document as original input" does not need to mean preserving any CDATA markup. (The general transformation downstream will not know or care what data is CDATA-escaped.) All that is required is that the one unparsed node get parsed, and the rest, not get parsed. The identity transform will do the latter just fine; you can ignore what that page says about CDATA sections on the output... the downstream XSLT will not know or care. (Unless you have additional (non-XML) requirements for the output that you haven't told us about.)

因此,如果您可以进行两次样式表转换,在它们之间进行序列化和解析(例如,不在传统的SAX管道中),则可以使用身份转换将起作用:您只需为已知的未解析节点提供一个附加模板,并使用disable-output-escaping即可,如Tomalak的答案.

So if you could do a two-stylesheet transform, with serialization and parsing in between (i.e. not in a traditional SAX pipeline, for example), then the identity transform would work: you'd just need an additional template for the known unparsed node, with disable-output-escaping, as in Tomalak's answer here.

但是,如果您不能进行两步转换,那么您正在使用哪种XSLT处理器?可能还有其他特定的途径.

But if you can't do a two-step transform... what XSLT processor are you using? There may be other avenues specific to it.

这篇关于XSLT将文本节点解析为XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 11:00