问题描述
很长时间以来,我一直很高兴地使用从其他 stackoverflow 答案中借用的代码来抓取 yahoo.finance 页面,并且效果很好,但是在过去的几周里,雅虎已将他们的表格更改为可折叠/可扩展的表格.这已经破坏了代码,尽管我尽了最大努力几天我还是无法修复这个错误.
I have been happily web scraping yahoo.finance pages for a long time using code largely borrowed from other stackoverflow answers and it has worked great, however in the last few weeks Yahoo has changed their tables to be collapsible/expandable tables. This has broken the code, and despite my best efforts for a few days I can't fix the bug.
以下是其他人使用多年的代码示例(然后由不同的人以不同的方式解析和处理).
Here is an example of the code that others have used for years (which is then parsed and processed in different ways by different people).
library(rvest)
library(tidyverse)
# Create a URL string
myURL <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL"
# Create a dataframe called df to hold this income statement called df
df <- myURL %>%
read_html() %>%
html_table(header = TRUE) %>%
map_df(bind_cols) %>%
as_tibble()
有人可以帮忙吗?
编辑更清晰:
如果你运行上面的然后查看df你得到
If you run the above then view df you get
# A tibble: 0 x 0
对于预期结果的示例,我们可以尝试另一个 yahoo 未更改的页面,如下所示:
For an example of the expected outcome, we can try another page yahoo hasn't changed such as the following:
# Create a URL string
myURL2 <- "https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL"
df2 <- myURL2 %>%
read_html() %>%
html_table(header = FALSE) %>%
map_df(bind_cols) %>%
as_tibble()
如果您查看 df2,您会得到 59 个观察值的 59 个观察值,其中两个变量是该页面上的主表,以
If you view df2 you get a tibble of 59 observations of two variables being the main table on that page, beginning with
市值(日内)5 [此处的价值]企业价值 3 [此处的价值]等等……
Market Cap (intraday)5 [value here]Enterprise value 3 [value here]And so on...
推荐答案
正如上面的评论中提到的,这里有一个替代方案,试图处理已发布的不同表大小.我已经完成了这项工作,并得到了朋友的帮助.
As mentioned in the comment above, here is an alternative that tries to deal with the different table sizes published. I have worked on this and have had help from a friend.
library(rvest)
library(tidyverse)
url <- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL
# Download the data
raw_table <- read_html(url) %>% html_nodes("div.D\(tbr\)")
number_of_columns <- raw_table[1] %>% html_nodes("span") %>% length()
if(number_of_columns > 1){
# Create empty data frame with the required dimentions
df <- data.frame(matrix(ncol = number_of_columns, nrow = length(raw_table)),
stringsAsFactors = F)
# Fill the table looping through rows
for (i in 1:length(raw_table)) {
# Find the row name and set it.
df[i, 1] <- raw_table[i] %>% html_nodes("div.Ta\(start\)") %>% html_text()
# Now grab the values
row_values <- raw_table[i] %>% html_nodes("div.Ta\(end\)")
for (j in 1:(number_of_columns - 1)) {
df[i, j+1] <- row_values[j] %>% html_text()
}
}
view(df)
这篇关于R:2019 年变更后的 yahoo.finance 网络抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!