本文介绍了使用 R 进行网页抓取,Javascript 被禁用的消息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
您好,我正在尝试使用 R 进行网络抓取,而这个特定的网站给我带来了很多麻烦.我想从这里提取表格:https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017
Hello I am attempting to webscrape in R and this one particular website is giving me a lot of trouble. I wish to extract the table from here:https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017
我尝试过的
代码:
url = 'https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'
webpage = read_html(url)
data = webpage %>% html_nodes('p') %>% html_text()
data
输出:
[1] "\r\n The page could not be loaded. This web site
currently does not fully support browsers with \"JavaScript\" disabled.
Please note that if you choose to continue without enabling
\"JavaScript\" certain functionalities on this website may not be
available.\r\n
推荐答案
在这种情况下,您可能需要使用 RSelenium
使用 docker
抓取 Javascript 网站
In this cases, you may want to use RSelenium
with docker
to scrape a Javascript website
require("RSelenium")
require("rvest")
system('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- RSelenium::remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
#Start the remote driver
remDr$open()
url = 'https://www.nationsreportcard.gov/profiles/stateprofile?
chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'
remDr$navigate(url)
doc <- read_html(remDr$getPageSource()[[1]])
table <- doc %>%
html_nodes(xpath = '//*[@id="gridAvergeScore"]/table') %>%
html_table(fill=TRUE)
head(table[[1]])
## JURISDICTION AVERAGE SCORE (0 - 500) AVERAGE SCORE (0 - 500) ACHIEVEMENT LEVEL PERCENTAGES ACHIEVEMENT LEVEL PERCENTAGES
## 1 JURISDICTION Score Difference from National public (NP) At or above Basic At or above Proficient
## 2 Massachusetts 249 10 87 53
## 3 Minnesota 249 10 86 53
## 4 DoDEA 249 9 91 51
## 5 Virginia 248 9 87 50
## 6 New Jersey 248 9 87 50
这篇关于使用 R 进行网页抓取,Javascript 被禁用的消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!