It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center
7年前关闭。
    <DOC NUMBER=1>
<DOCFULL> -->
<br><div class="c0">
<p class="c1"><span class="c2">Dokument 1 von 3</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">Associated Press Financial Wire</span></p>
</div>
<br><div class="c3">
<p class="c1"><span class="c2">April 25, 2012 Wednesday 9:18 PM GMT </span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c6">Apple CEO Tim Cook emerges from Steve Jobs' shadow</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">BYLINE: </span><span class="c2">By PETER SVENSSON, AP Technology Writer</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">SECTION: </span><span class="c2">BUSINESS NEWS</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">LENGTH: </span><span class="c2">794 words</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">DATELINE: </span><span class="c2">NEW YORK </span></p>
</div>
<br><div class="c4">
<p class="c8"><span class="c2"> MAIN TEXT 1</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">LOAD-DATE: </span><span class="c2">April 26, 2012</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">LANGUAGE: </span><span class="c2">ENGLISH</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">PUBLICATION-TYPE: </span><span class="c2">Newswire</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">Copyright 2012 Associated Press<br>All Rights Reserved</span></p>
</div>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->

我是XPath的新手,我想结合R(盾灿朗的XML包)使用它来查询我从ListSnExxIS接收到的HTML文档。文档包含多篇新闻文章,每篇文章都由<DOC NUMBER=1> <DOCFULL>标记限定。我想为每个文档提取一些信息,例如,为了提取节信息,我已经做到了:
doc <- htmlParse("hmtldoc.HTML")
xpathSApply(doc,"//span[text()='SECTION: ']/..", xmlValue)

这给了我:
[1] "SECTION: BUSINESS NEWS" "SECTION: BUSINESS NEWS" "SECTION: BUSINESS NEWS"

这是我可以使用的输出。主要的问题是不是每一篇文章都有章节信息。我需要知道的是哪篇文章提供了这些信息,哪些没有,最好是返回NA或空list元素,这样我就可以自己推断出这些信息。
与此问题相关:我试图提出一个解决方案,首先选择DOC或DOCFULL节点,然后从那里继续,例如:
xpathSApply(doc,"//DOCFULL/*/span[text()='SECTION: ']/..", xmlValue)

我认为这应该返回与上面相同的文本,但它没有。无论如何,我对这门语言还是很陌生的,感谢任何帮助。

最佳答案

因为在DOCFULLspan之间有多个“level”子元素,您将需要
含糊其辞

//DOCFULL//*/span[text()='SECTION: ']/..

或者
具体说明级别(div和p)
//DOCFULL/*/*/span[text()='SECTION: ']/..

08-24 22:29