如何在< meta name ...>内获取信息在html中使用htmlParse和xpathSApply标记

本文介绍了如何在< meta name ...>内获取信息在html中使用htmlParse和xpathSApply标记的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一堆网页，我想提取它们的发布日期。
对于某些网页，日期位于abbr标签中（如：abbr class = \published\title = \2012-03-14T07：13：39 + 00：00\ > 2012-03-14，7:13），并且我可以使用以下命令获得日期：
doc = htmlParse（theURL，asText = T）
xpathSApply（doc，// abbr ，xmlValue）

但是对于其他网页，日期位于mega标签中，例如：

meta name = \ created\content = \2011-12-29T11：49：23 + 00：00\

meta name = \OriginalPublicationDate\content = \2012/11 / 14 10：56：58 \

我尝试了xpathSApply（doc，// meta，xmlValue），但它不起作用。

那么，我应该使用什么样的模式来代替// meta？

谢谢！

解决方案

以此页面为例：

  library（XML）
 url<  - http://stackoverflow.com/questions/22342501/
 doc<  -  htmlParse（url，useInternalNodes = T）
 names< -  doc [// meta / @ name] 
 content<  -  doc [// meta / @ content] 
 cbind（名称，内容）
＃名称内容
 ＃[1，]twitter：cardsummary
＃[2，]twitter：domainstackoverflow.com
＃[3，]og：typewebsite
＃[4，]og：imagehttp://cdn.sstatic.net/stackoverflow/img/[email protected]?v=fde65a5a78c6
＃[5， ]og：title如何获取<元名称中的信息...>在HTML中使用htmlParse和xpathSApply标记
＃[6，]og：description我有一堆网页，我想提取它们的发布日期。 \\\
对于一些网页，da[truncated] 
＃[7，]og：urlhttp://stackoverflow.com/questions/22342501/how-to-get-information-within-meta名称标签在html-usi[truncated]

  xpathSApply（doc，// meta，xmlValue）

$ b

是 xmlValue（...）返回元素内容（例如，元素）。< meta> 标签没有文字。

I have a bunch of webpages and I want to extract their publishing dates. For some webpages, the dates are in the "abbr" tag (like: abbr class=\"published\" title=\"2012-03-14T07:13:39+00:00\">2012-03-14, 7:13"), and I was able to get the dates using: doc=htmlParse(theURL,asText=T)xpathSApply(doc,"//abbr",xmlValue)
But for other webpages, the dates are in the "mega" tags, for example:
meta name=\"created\" content=\"2011-12-29T11:49:23+00:00\"
meta name=\"OriginalPublicationDate\" content=\"2012/11/14 10:56:58\"
I tried xpathSApply(doc, "//meta",xmlValue), but it didn't work.
So, what pattern should I use instead of "//meta"?
Thank you!
解决方案
Using this page as an example:
library(XML) url <- "http://stackoverflow.com/questions/22342501/" doc <- htmlParse(url, useInternalNodes=T) names <- doc["//meta/@name"] content <- doc["//meta/@content"] cbind(names,content) # names content # [1,] "twitter:card" "summary" # [2,] "twitter:domain" "stackoverflow.com" # [3,] "og:type" "website" # [4,] "og:image" "http://cdn.sstatic.net/stackoverflow/img/[email protected]?v=fde65a5a78c6" # [5,] "og:title" "how to get information within <meta name...> tag in html using htmlParse and xpathSApply" # [6,] "og:description" "I have a bunch of webpages and I want to extract their publishing dates. \nFor some webpages, the da" [truncated] # [7,] "og:url" "http://stackoverflow.com/questions/22342501/how-to-get-information-within-meta-name-tag-in-html-usi" [truncated]
The problem with
xpathSApply(doc, "//meta",xmlValue)
is that xmlValue(...) returns the element content (e.g, the text part of an element). <meta> tags have no text.

这篇关于如何在< meta name ...>内获取信息在html中使用htmlParse和xpathSApply标记的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

meta

如何在&lt; meta name ...&gt;内获取信息在html中使用htmlParse和xpathSApply标记

问题描述

如何在< meta name ...>内获取信息在html中使用htmlParse和xpathSApply标记