问题描述
PGA 巡回赛的网站有一个
其他页面,例如
...但是,使用排行榜 url 链接到 .json
文件 https://lbdata.pgatour.com/2021/r/005/leaderboard.json
没有帮助...相反,我在使用 jsonlite::fromJson
那么两个问题:
是否可以将此 .JSON 文件读入 R?(也许它以某种方式受到保护)?也许只是我的一个问题,或者我在这里错过了 R 中的其他东西?
鉴于 URL 发生变化,如何在 R 中动态获取 URL 值?如果我能以某种方式获取所有
global.leaderboardConfig
对象,那就太好了,因为这样我就可以访问leaderboardUrl
.
谢谢!!
如前所述,这个页面是由一些javascript
动态生成的.
甚至 json
文件地址似乎是动态的,并且您尝试打开的地址不再有效:
https://lbdata.pgatour.com/2021/r/003/leaderboard.json?userTrackingId=exp=1612495792~acl=*~hmac=722f704283f795e81211198427386ee075ce929f79d79f79d7e79d79d处理您的请求时发生错误.参考 #199.cf05d517.1613439313.4ed8cf21
要获取数据,您可以在安装 RSelenium 后使用 RSeleniuma href="https://docs.ropensci.org/RSelenium/articles/docker.html" rel="nofollow noreferrer">Docker Selenium 服务器.
安装很简单,Docker
旨在使图像开箱即用.
安装Docker
后,运行Selenium
服务器就这么简单:
docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0
请注意,这作为一个整体需要超过 2 Gb
的磁盘空间.
Selenium
模拟 Web 浏览器,并允许在渲染 javascript
之后获取页面的最终 HTML
内容:
库(RSelenium)图书馆(rvest)remDr %html_text()总计 <- xml2::read_html(remDr$getPageSource()[[1]]) %>%html_nodes(.total") %>%html_text()data.frame(玩家 = 玩家,总计 = 总计 [-1])球员总数1 丹尼尔·伯杰 (PB) -182 特立独行的麦克尼利 (PB) -163 帕特里克·坎特利 (PB) -154 乔丹-斯皮思 (PB) -155 保罗凯西 (PB) -146 内特·莱斯利 (PB) -147 查理霍夫曼 (PB) -138 卡梅伦·特林格尔 (PB) -13...
由于表格不使用 table
标签,html_table
不起作用,需要单独提取列.
The PGA tour's website has a leaderboard page page and I am trying to scrape the main table on the website for a project.
library(dplyr)
leaderboard_table <- xml2::read_html('https://www.pgatour.com/leaderboard.html') %>%
html_nodes('table') %>%
html_table()
however instead of pulling the tables, it returns this odd output...
Other pages such as the schedule page scrape fine without any issues, see below. It is only the leaderboard page I am having trouble with.
schedule_url <- 'https://www.pgatour.com/tournaments/schedule.html'
schedule_table <- xml2::read_html(schedule_url) %>% html_nodes('table.table-styled') %>% html_table()
schedule_df <- schedule_table[[1]]
# this works fine
Edit Before Bounty: the below answer is a helpful start, however there is a problem. The JSON files name changes based on the round (/r/003
for 3rd round) and probably based on other aspects of the golf tournament as well. Currently there is this that i see in the elements tab:
...however, using the leaderboard url link to the .json
file https://lbdata.pgatour.com/2021/r/005/leaderboard.json
is not helping... instead, I receive this error when using jsonlite::fromJson
Two questions then:
Is is possible to read this .JSON file into R? (perhaps it is protected in some way)? Maybe just an issue on my end, or am I missing something else in R here?
Given that the URL changes, how can I dynamically grab the URL value in R? It would be great if I could grab all of the
global.leaderboardConfig
object somehow, because that would give me access to theleaderboardUrl
.
Thanks!!
As already mentioned, this page is dynamically generated by some javascript
.
Even the json
file address seems to be dynamic, and the address you're trying to open isn't valid anymore :
https://lbdata.pgatour.com/2021/r/003/leaderboard.json?userTrackingId=exp=1612495792~acl=*~hmac=722f704283f795e8121198427386ee075ce41e93d90f8979fd772b223ea11ab9
An error occurred while processing your request.
Reference #199.cf05d517.1613439313.4ed8cf21
To get the data, you could use RSelenium after installing a Docker Selenium server.
The installation is straight forward, and Docker
is designed to make images work out of the box.
After Docker
installation, running the Selenium
server is as simple as:
docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0
Note that this as a whole requires over 2 Gb
disk space.
Selenium
emulates a Web browser and allows among others to get the final HTML
content of the page, after rendering of the javascript
:
library(RSelenium)
library(rvest)
remDr <- remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
# Open connexion to Selenium server
remDr$open()
remDr$getStatus()
remDr$navigate("https://www.pgatour.com/leaderboard.html")
players <- xml2::read_html(remDr$getPageSource()[[1]]) %>%
html_nodes(".player-name-col") %>%
html_text()
total <- xml2::read_html(remDr$getPageSource()[[1]]) %>%
html_nodes(".total") %>%
html_text()
data.frame(players = players, total = total[-1])
players total
1 Daniel Berger (PB) -18
2 Maverick McNealy (PB) -16
3 Patrick Cantlay (PB) -15
4 Jordan Spieth (PB) -15
5 Paul Casey (PB) -14
6 Nate Lashley (PB) -14
7 Charley Hoffman (PB) -13
8 Cameron Tringale (PB) -13
...
As the table doesn't use the table
tag, html_table
doesn't work and columns need to be extracted individually.
这篇关于在 R 中的高尔夫网站上抓取排行榜表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!