html - XPath表达式: Select elements between A HREF ="expr" tags

我没有找到一种明确的方法来选择HTML文件中两个 anchor 之间的所有节点(<a></a>标记对)。

第一个 anchor 具有以下格式:

<a href="file://START..."></a>

第二 anchor :

<a href="file://END..."></a>

我已经验证了可以使用-来选择两者(请注意，我使用的是HTML Agility Pack):

HtmlNode n0 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://START')]"));
HtmlNode n1 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://END')]"));

考虑到这一点，以及我业余的XPath技能，我编写了以下表达式来获取两个 anchor 之间的所有标记:

html.DocumentNode.SelectNodes("//*[not(following-sibling::a[starts-with(@href,'file://START0')]) and not (preceding-sibling::a[starts-with(@href,'file://END0')])]");

这似乎可行，但是会选择所有HTML文档!

我需要例如以下HTML片段:

<html>
...

<a href="file://START0"></a>
<p>First nodes</p>
<p>First nodes
    <span>X</span>
</p>
<p>First nodes</p>
<a href="file://END0"></a>

...
</html>

删除两个 anchor ，三个P(当然包括内部SPAN)。

有什么办法吗？

我不知道XPath 2.0是否提供更好的方法来实现这一目标。

* 编辑(特殊情况!)*

我还应该处理以下情况:

“在X和X'之间选择标签，其中X是<a href="file://..."></a>”

因此，而不是:

<a href="file://START..."></a>
<!-- xhtml to be extracted -->
<a href="file://END..."></a>

我还应该处理:

<p>
  <a href="file://START..."></a>
</p>
<!-- xhtml to be extracted -->

<p>
  <a href="file://END..."></a>
</p>

再一次非常感谢你。

最佳答案

使用此XPath 1.0表达式:

//a[starts-with(@href,'file://START')]/following-sibling::node()
     [count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
     =
      count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
     ]

或者，使用此XPath 2.0表达式:

    //a[starts-with(@href,'file://START')]/following-sibling::node()
  intersect
    //a[starts-with(@href,'file://END')]/preceding-sibling::node()

XPath 2.0表达式使用XPath 2.0 intersect运算符。

XPath 1.0表达式将Kayessian(在@Michael Kay之后)公式用于两个节点集的相交:

$ns1[count(.|$ns2) = count($ns2)]

使用XSLT进行验证:

此XSLT 1.0转换:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "    //a[starts-with(@href,'file://START')]/following-sibling::node()
         [count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
         =
          count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
         ]
  "/>
 </xsl:template>
</xsl:stylesheet>

应用于提供的XML文档时:
<html>... <a href="file://START0"></a> First nodes First nodes X First nodes <a href="file://END0"></a>... </html>

产生所需的正确结果:
First nodes First nodes X First nodes

此XSLT 2.0转换:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:strip-space elements="*"/> <xsl:template match="/"> <xsl:copy-of select= " //a[starts-with(@href,'file://START')]/following-sibling::node() intersect //a[starts-with(@href,'file://END')]/preceding-sibling::node() "/> </xsl:template> </xsl:stylesheet>

将应用于相同的XML文档(如上)时，将再次精确地生成所需的结果。
关于html - XPath表达式: Select elements between A HREF ="expr" tags，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/6554261/