问题描述
我有一堆网页,我想提取它们的发布日期。
对于某些网页,日期位于abbr标签中(如:abbr class = \published\title = \2012-03-14T07:13:39 + 00:00\ > 2012-03-14,7:13),并且我可以使用以下命令获得日期:
doc = htmlParse(theURL,asText = T)
xpathSApply(doc,// abbr ,xmlValue)
但是对于其他网页,日期位于mega标签中,例如:
meta name = \ created\content = \2011-12-29T11:49:23 + 00:00\
meta name = \OriginalPublicationDate\content = \2012/11 / 14 10:56:58 \
我尝试了xpathSApply(doc,// meta,xmlValue),但它不起作用。
那么,我应该使用什么样的模式来代替// meta?
谢谢!
以此页面为例:
library(XML)
url< - http://stackoverflow.com/questions/22342501/
doc< - htmlParse(url,useInternalNodes = T)
names< - doc [// meta / @ name]
content< - doc [// meta / @ content]
cbind(名称,内容)
#名称内容
#[1,]twitter:cardsummary
#[2,]twitter:domainstackoverflow.com
#[3,]og:typewebsite
#[4,]og:imagehttp://cdn.sstatic.net/stackoverflow/img/[email protected]?v=fde65a5a78c6
#[5, ]og:title如何获取<元名称中的信息...>在HTML中使用htmlParse和xpathSApply标记
#[6,]og:description我有一堆网页,我想提取它们的发布日期。 \\\
对于一些网页,da[truncated]
#[7,]og:urlhttp://stackoverflow.com/questions/22342501/how-to-get-information-within-meta名称标签在html-usi[truncated]
xpathSApply(doc,// meta,xmlValue)
$ b
是 xmlValue(...)返回元素内容(例如,元素)。< meta> 标签没有文字。
I have a bunch of webpages and I want to extract their publishing dates. For some webpages, the dates are in the "abbr" tag (like: abbr class=\"published\" title=\"2012-03-14T07:13:39+00:00\">2012-03-14, 7:13"), and I was able to get the dates using: doc=htmlParse(theURL,asText=T)xpathSApply(doc,"//abbr",xmlValue)
But for other webpages, the dates are in the "mega" tags, for example:
meta name=\"created\" content=\"2011-12-29T11:49:23+00:00\"
meta name=\"OriginalPublicationDate\" content=\"2012/11/14 10:56:58\"
I tried xpathSApply(doc, "//meta",xmlValue), but it didn't work.
So, what pattern should I use instead of "//meta"?
Thank you!
Using this page as an example:
library(XML) url <- "http://stackoverflow.com/questions/22342501/" doc <- htmlParse(url, useInternalNodes=T) names <- doc["//meta/@name"] content <- doc["//meta/@content"] cbind(names,content) # names content # [1,] "twitter:card" "summary" # [2,] "twitter:domain" "stackoverflow.com" # [3,] "og:type" "website" # [4,] "og:image" "http://cdn.sstatic.net/stackoverflow/img/[email protected]?v=fde65a5a78c6" # [5,] "og:title" "how to get information within <meta name...> tag in html using htmlParse and xpathSApply" # [6,] "og:description" "I have a bunch of webpages and I want to extract their publishing dates. \nFor some webpages, the da" [truncated] # [7,] "og:url" "http://stackoverflow.com/questions/22342501/how-to-get-information-within-meta-name-tag-in-html-usi" [truncated]
The problem with
xpathSApply(doc, "//meta",xmlValue)
is that xmlValue(...) returns the element content (e.g, the text part of an element). <meta> tags have no text.
这篇关于如何在< meta name ...>内获取信息在html中使用htmlParse和xpathSApply标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!