本文介绍了R 如何检查 XPath 是否存在的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望有比我更博学的人可以在这里抛砖引玉.

hoping someone more knowledgeable than me can throw some light here.

作为更大的网络爬虫的一部分,我想从一组页面中提取元数据.当我运行它时它倒塌了,调查表明这是由于 Xpath 被请求的其中一个不存在.

As part of a larger web-scraper I want to pull meta data out of a set of pages. When I ran this it fell over, investigation shows this was due to one of the Xpath's being requested not existing.

我可以看到一个潜在的解决方案是将页面的所有元数据抓取到一个向量中,并在构建我想要的新向量之前检查每个需要的元是否存在.

I can see one potential solution is to grab ALL the meta for a page into a vector and to check if each required one exists before building a new vector of just those I want.

但是

如果我只抓取页面中存在的我想要的部分,那就更好了.

It would be even better if I only grabbed the bits I want if they exist in the page.

require(XML)
require(RCurl)
parsed <- htmlParse("http://www.coindesk.com/information")

meta <- list()
meta[1] <- xpathSApply(parsed, "//meta[starts-with(@property, \"og:title\")]", xmlGetAttr,"content")
meta[2] <- xpathApply(parsed, "//meta[starts-with(@property, \"og:description\")]", xmlGetAttr,"content")
meta[3] <- xpathApply(parsed, "//meta[starts-with(@property, \"og:url\")]",  xmlGetAttr,"content")
meta[4] <- xpathApply(parsed, "//meta[starts-with(@property, \"article:published_time\")]",  xmlGetAttr,"content")
meta[5] <- xpathApply(parsed, "//meta[starts-with(@property, \"article:modified_time\")]",  xmlGetAttr,"content")

这将引发错误,因为 og:description 不在此页面中.

This will throw an error as og:description isn't in this page.

Error in meta[2] <- xpathApply(parsed, "//meta[starts-with(@property, \"og:description\")]",  :
  replacement has length zero

任何人都可以建议一个简单的测试,在尝试提取它之前检查它的存在,优雅地摔倒,也许是一个 NULL 响应?

Can anyone suggest a simple test that will check for its existence before trying to extract it, falling over gracefully with perhaps a NULL response?

推荐答案

假设在您尝试处理空列表时出现错误...

Assuming the error comes when you try and process the empty list...

> parsed <- htmlParse("http://www.coindesk.com/information")
> meta <- xpathApply(parsed, "//meta[starts-with(@property, \"og:description\")]", xmlGetAttr,"content")
> meta
list()
> length(meta)==0
[1] TRUE

然后测试 length(meta)==0 - 如果元素丢失,则为 TRUE.否则它的 FALSE - 如在这个提取标题属性的例子中:

Then test for length(meta)==0 - which is TRUE if the element is missing. Otherwise its FALSE - as in this example of extracting the title property:

> meta <- xpathApply(parsed, "//meta[starts-with(@property, \"og:title\")]", xmlGetAttr,"content")
> meta
[[1]]
[1] "Beginner's guide to bitcoin - CoinDesk's Information Center"

> length(meta)==0
[1] FALSE

这篇关于R 如何检查 XPath 是否存在的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 12:18
查看更多