问题描述
我无法解决 R 中的这个问题,如果您能在这里给我一些建议,我将不胜感激.
I can't get my head around this problem in R and I would really appreciate if you could leave a piece of advice for me here.
我正在尝试从 https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data 仅供个人使用(当然).
I am trying to scrape historical bond yield data from https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data for personal use only (of course).
此处提供的解决方案非常有效,但只能抓取每日数据的前 24 个时间戳:从网页抓取数据表和数据
The solution provided here works really well but only goes as far as to scrape the first 24 time stamps of daily data:webscraping data tables and data from a web page
我想要实现的是更改日期范围以获取更多历史数据.基于 SelectorGadget 工具,日期范围的输入表单 id 称为 //*[(@id = "widgetFieldDateRange")]
What I am trying to achieve is to change the date range in order to scrape more historical data.Based on the SelectorGadget tool, the input form id for the date range is called //*[(@id = "widgetFieldDateRange")]
我也尝试使用以下代码行来更改日期值但没有成功:
I have also tried using the following lines of code to change the date values but without success:
library(rvest)
url1 <- "https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data" #Spain 5yr yield
session <- html_session(url1)
pgform <- html_form(session)[[1]]
pgform$fields[[3]]$value <- "01/01/2010 - 09/10/2020"
result <- submit_form(session, pgform)
问题:知道如何正确提交新日期范围并检索扩展时间序列吗?
非常感谢您的帮助!
PS:不幸的是,URL 不会根据日期范围而改变.
PS: Unfortunately, the URL does not change based on the date range.
推荐答案
可以直接执行POST请求:
You can perform the POST request directly :
POST https://www.investing.com/instruments/HistoricalDataAjax
您需要从页面中抓取一些请求中必需的信息:
You need to scrape a few information from the page that are necessary in the request :
- 来自
div
标签的pair_ids
属性 - 来自
.instrumentHeader
类中的h2
标签的标头值
- the
pair_ids
attribute from adiv
tag - the header value from
h2
tag inside.instrumentHeader
class
完整代码:
library(rvest)
library(httr)
startDate <- as.Date("2020-06-01")
endDate <- Sys.Date() #today
userAgent <- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
mainUrl <- "https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data"
s <- html_session(mainUrl)
pair_ids <- s %>%
html_nodes("div[pair_ids]") %>%
html_attr("pair_ids")
header <- s %>% html_nodes(".instrumentHeader h2") %>% html_text()
resp <- s %>% rvest:::request_POST(
"https://www.investing.com/instruments/HistoricalDataAjax",
add_headers('X-Requested-With'= 'XMLHttpRequest'),
user_agent(userAgent),
body = list(
curr_id = pair_ids,
header = header[[1]],
st_date = format(startDate, format="%m/%d/%Y"),
end_date = format(endDate, format="%m/%d/%Y"),
interval_sec = "Daily",
sort_col = "date",
sort_ord = "DESC",
action = "historical_data"
),
encode = "form") %>%
html_table
print(resp[[1]])
输出:
Date Price Open High Low Change %
1 Oct 09, 2020 -0.339 -0.338 -0.333 -0.361 2.42%
2 Oct 08, 2020 -0.331 -0.306 -0.306 -0.338 7.47%
3 Oct 07, 2020 -0.308 -0.323 -0.300 -0.324 -0.65%
4 Oct 06, 2020 -0.310 -0.288 -0.278 -0.319 7.27%
5 Oct 05, 2020 -0.289 -0.323 -0.278 -0.331 -10.39%
6 Oct 03, 2020 -0.322 -0.322 -0.322 -0.322 1.42%
7 Oct 02, 2020 -0.318 -0.311 -0.302 -0.320 5.65%
.....................................................
.....................................................
96 Jun 08, 2020 -0.162 -0.152 -0.133 -0.173 13.29%
97 Jun 05, 2020 -0.143 -0.129 -0.127 -0.154 13.49%
98 Jun 04, 2020 -0.126 -0.089 -0.063 -0.148 38.46%
99 Jun 03, 2020 -0.091 -0.120 -0.087 -0.128 -35.00%
100 Jun 02, 2020 -0.140 -0.148 -0.137 -0.166 14.75%
101 Jun 01, 2020 -0.122 -0.140 -0.101 -0.150 -17.57%
这也适用于任何页面,如果您替换 mainUrl
变量的值,例如 这个
This also works for any page if you replace the value of mainUrl
variable for instance this one
这篇关于在 R 中使用表单输入进行 rvest Webscraping的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!