本文介绍了在 R 中抓取 Javascript 生成的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现 R 中的网页抓取任务通常可以通过易于使用的 rvest 包通过获取生成网页的 html 代码来实现.然而,当网站使用 Javascript 来显示相关数据时,这种通常"的方法(我可能会这么称呼它)似乎遗漏了一些功能.作为一个工作示例,我想从 this 网站.通常方法的两个主要障碍包括底部的加载更多"按钮和使用 xpath 提取标题.特别是:

I find web scraping tasks in R can often be achieved with easy to use rvest package by fetching the html code that generates a webpage. This „usual" approach (as I may call it), however, seems to miss some functionality when the website uses Javascript to display the relevant data. As a working example, I would like to scrape news headlines from this website. The two main obstacles for the usual approach include the „load more" button at the bottom and the extraction of the headlines using xpath. In particular:

library(rvest)
library(magrittr)

url = "http://www.nestle.com/media/news-archive#agregator-search-results"
webs = read_html(url)

# Headline of the first news based on its xpath
webs %>% html_nodes(xpath="//*[@id='agregator-search-results']/span[2]/ul/li[1]/a/span[2]/span[1]") %>% html_text
#[1] ""

# Same for the description of the first news
webs %>% html_nodes(xpath="//*[@id='agregator-search-results']/span[2]/ul/li[1]/a/span[2]/span[2]") %>% html_text
#[1] ""

也许有人可以阐明以下(其中一个)问题:

Maybe someone can shed light on (one of) the following questions:

  1. 我是否遗漏了一些明显的东西?也就是说,在这种情况下,是否可以使用基于 rvest 的常用方法来抓取标题?然而,就我目前的理解而言,情况并非如此.
  2. RSeleniumphantom JS 是唯一的出路吗?换个说法,特别是不用RSeleniumphantomJS,能不能完成任务?这可能包括提取标题或加载更多标题(或两者兼有).
  1. Do I miss something obvious here? That is, is it possible to scrape the headlines using the usual approach based on rvest in this case? As to my current understanding, however, that is not the case.
  2. Is RSelenium and phantom JS the only way to go here? To put it different, can the task be achieved without the use of RSelenium and phantomJS, in particular? This could include either the extraction of the headlines or loading more headlines (or both).

感谢任何输入.

推荐答案

恕我直言,有时在后台查找原始数据会更好:

Imo, it's sometimes better to look for the raw data in the background:

library(jsonlite)
library(RCurl)
n <- 8 # number of news items to pull
useragent <- "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0"
url <- sprintf("http://www.nestle.com/_handlers/advancedsearch.ashx?q=Nestle%%2Bdaterange%%3A..2016-01-05&index=0&num=%d&client=Nestle_Corp&site=Nestle_Corp_Media&requiredfields=MediaType:/media/pressreleases/allpressreleases|MediaType:/Media/NewsAndFeatures|MediaType:/Media/News&sort=date:D:R:d1&filter=p&access=p&entsp=a&oe=UTF-8&ie=UTF-8&ud=1&ProxyReload=1&exclude_apps=1&entqr=3&getfields=*", n)
json <- getURL(url, useragent=useragent)
res <- fromJSON(json)
df <- res$GSP$RES$R
head(cbind(df[, c("U", "T")], df$FS$'@VALUE'))
#                                                                                                U                                                                                 T df$FS$"@VALUE"
# 1                                   http://www.nestle.com/media/newsandfeatures/nestle-150-years &#39;Good Food, Good Life&#39;: Celebrating 150 years of <b>Nestlé</b> <b>...</b>     2016-01-01
# 2                                   http://www.nestle.com/media/newsandfeatures/2015-in-pictures                                           2015 in pictures | <b>Nestlé</b> Global     2015-12-23
# 3                         http://www.nestle.com/media/news/nescafe-dolce-gusto-expands-in-brazil                Coffee superstar: Nescafé Dolce Gusto expands in Brazil <b>...</b>     2015-12-17
# 4                        http://www.nestle.com/media/news/nestle-waters-new-bottling-plant-italy  <b>Nestlé</b> Waters needs youth, for its new bottling plant in Italy <b>...</b>     2015-12-10
# 5 http://www.nestle.com/media/news/nestle-launch-wellness-club-personalised-health-service-japan     Matcha made in nutritional heaven: <b>Nestlé</b> launches Wellness <b>...</b>     2015-12-08
# 6        http://www.nestle.com/media/news/nestle-completes-chf-8-billion-share-buyback-programme          <b>Nestlé</b> completes CHF 8 billion share buyback programme <b>...</b>     2015-12-07

df 包含更多信息,如果您想使用其中一些信息,则必须取消嵌套.

df contains more information, some of which prly has to be unnested if you want to use it.

这篇关于在 R 中抓取 Javascript 生成的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 21:08
查看更多