从闪存页面中抓取数据

从闪存页面中抓取数据

本文介绍了使用 rvest 从闪存页面中抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此页面抓取数据:

I am trying to scrape data from this page:

http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?

If I try to scrape the name of the players using the css selector and the usual rvest syntax:

names <- read_html("http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?") %>%
html_nodes(".scoring-player-name") %>% sapply(html_text)

everything goes well.

Unfortunately if I try to scrape the statistics below (first serve pts won, ..)using the selector .stat-breakdown span I am not able to retrieve any data.

I know rvest is generally not recommended to scrape pages created dynamically, however I don't understand why some data are scraped and some not.

解决方案

I don't use Rvest. If you follow the code below you should get to the format which is in the picture basically a string which you could transform to dataframe based on separators :, .

This Tag also contains more information than it was displayed in UI of webpage.I can try also RSelenium but need to get my other PC. So I would let you know if RSelenium worked for me.

library(XML)
library(RCurl)
library(stringr)

url<-"http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?"
url2<-getURL(url)
parsed<-htmlParse(url2)
# get messi data from tag
step1<-xpathSApply(parsed,"//script[@id='matchStatsData']",xmlValue)
# removing some unwanted characters
step2<-str_replace_all(step1,"\r\n","")
step3<-str_replace_all(step2,"\t","")
step4<-str_replace_all(step3,"[[{}]\"]","")

Output then is a string like this

这篇关于使用 rvest 从闪存页面中抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 20:52