问题描述
当 R 中没有使用 download.file() 上传特定文件时,是否有任何可能的解决方案从任何网站提取文件.
我有这个网址
https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2016&month=0&season1=2016&ind=0
有一个将csv文件导出到我的工作目录的链接,但是当我右键单击网页上的导出数据超链接并选择链接地址时变成了下面的脚本
javascript:__doPostBack('LeaderBoard1$cmdCSV','')
而不是让我访问 csv 文件的 url.
有没有办法解决这个问题.
您可以将 RSelenium
用于此类工作.下面的脚本完全适用于我,它也适用于您,并在文本中进行了小的编辑.该解决方案使用两个包:RSelenium
来自动化 Chrome,以及 here
来选择您的活动目录.
library(RSelenium)图书馆(这里)
这是您提供的网址:
url
这是下载按钮的 ID.您可以通过右键单击 Chrome 中的按钮并点击检查"来找到它.
button_id
我们将让 Chrome 自动下载文件,它会转到您的默认下载位置.在脚本的末尾,我们希望将其移动到您的当前目录.所以首先让我们设置文件的名称(根据
点击那个,然后点击你想要的元素:
这会在元素"面板中将其拉起(突出显示).右键单击突出显示的行,然后单击复制选择器".如果您想使用 XPath,也可以单击复制 XPath".
这就给了你你的代码!
buttons
繁荣.
Is there any possible solution to extract the file from any website when there is no specific file uploaded using download.file() in R.
I have this url
https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2016&month=0&season1=2016&ind=0
there is a link to export csv file to my working directory, but when i right click on the export data hyperlink on the webpage and select the link addressit turns to be the following script
javascript:__doPostBack('LeaderBoard1$cmdCSV','')
instead of the url which give me access to the csv file.
Is there any solution to tackle this problem.
解决方案
You can use
RSelenium
for jobs like this. The script below works for me exactly as is, and it should for you as well with minor edits noted in the text. The solution uses two packages: RSelenium
to automate Chrome, and here
to select your active directory.
library(RSelenium)
library(here)
Here's the URL you provided:
url <- paste0(
"https://www.fangraphs.com/leaders.aspx",
"?pos=all",
"&stats=bat",
"&lg=all",
"&qual=y",
"&type=8",
"&season=2016",
"&month=0",
"&season1=2016",
"&ind=0"
)
Here's the ID of the download button. You can find it by right-clicking the button in Chrome and hitting "Inspect."
button_id <- "LeaderBoard1_cmdCSV"
We're going to automate Chrome to download the file, and it's going to go to your default download location. At the end of the script we'll want to move it to your current directory. So first let's set the name of the file (per fangraphs.com) and your download location (which you should edit as needed):
filename <- "FanGraphs Leaderboard.csv"
download_location <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
Now you'll want to start a browser session. I use Chrome, and specifying this particular Chrome version (using the
chromever
argument) works for me. YMMV; check the best way to start a browser session for you.
An
rsDriver
object has two parts: a server and a browser client. Most of the magic happens in the browser client.
driver <- rsDriver(
browser = "chrome",
chromever = "74.0.3729.6"
)
server <- driver$server
browser <- driver$client
Using the browser client, navigate to the page and click that button.
Quick note before you do:
RSelenium
may start looking for the button and trying to click it before there's anything to click. So I added a few lines to watch for the button to show up, and then click it once it's there.
buttons <- list()
browser$navigate(url)
while (length(buttons) == 0) {
buttons <- browser$findElements(button_id, using = "id")
}
buttons[[1]]$clickElement()
Then wait for the file to show up in your downloads folder, and move it to the current project directory:
while (!file.exists(file.path(download_location, filename))) {
Sys.sleep(0.1)
}
file.rename(file.path(download_location, filename), here(filename))
Lastly, always clean up your server and browser client, or
RSelenium
gets quirky with you.
browser$close()
server$stop()
And you're on your merry way!
Note that you won't always have an element ID to use, and that's OK. IDs are great because they uniquely identify an element and using them requires almost no knowledge of website language. But if you don't have an ID to use, above where I specify
using = "id"
, you have a lot of other options:
using = "xpath"
using = "css selector"
using = "name"
using = "tag name"
using = "class name"
using = "link text"
using = "partial link text"
Those give you a ton of alternatives and really allow you to identify anything on the page.
findElements
will always return a list. If there's nothing to find, that list will be of length zero. If it finds multiple elements, you'll get all of them.
XPath and CSS selectors in particular are super versatile. And you can find them without really knowing what you're doing. Let's walk through an example with the "Sign In" button on that page, which in fact does not have an ID.
Start in Chrome by pretty Control+Shift+J to get the Developer Console. In the upper left corner of the panel that shows up is a little icon for selecting elements:
Click that, and then click on the element you want:
That'll pull it up (highlight it) over in the "Elements" panel. Right-click the highlighted line and click "Copy selector." You can also click "Copy XPath," if you want to use XPath.
And that gives you your code!
buttons <- browser$findElements(
"#linkAccount > div > div.label-account",
using = "css selector"
)
buttons[[1]]$clickElement()
Boom.
这篇关于当页面上没有嵌入特定文件时,如何使用R从网页下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!