本文介绍了R:2019 年变更后的 yahoo.finance 网络抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


很长时间以来,我一直很高兴地使用从其他 stackoverflow 答案中借用的代码来抓取 yahoo.finance 页面,并且效果很好,但是在过去的几周里,雅虎已将他们的表格更改为可折叠/可扩展的表格.这已经破坏了代码,尽管我尽了最大努力几天我还是无法修复这个错误.

I have been happily web scraping yahoo.finance pages for a long time using code largely borrowed from other stackoverflow answers and it has worked great, however in the last few weeks Yahoo has changed their tables to be collapsible/expandable tables. This has broken the code, and despite my best efforts for a few days I can't fix the bug.


Here is an example of the code that others have used for years (which is then parsed and processed in different ways by different people).


# Create a URL string
myURL <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL"

# Create a dataframe called df to hold this income statement called df
df <- myURL %>%
  read_html() %>%
  html_table(header = TRUE) %>%
  map_df(bind_cols) %>%




If you run the above then view df you get

# A tibble: 0 x 0

对于预期结果的示例,我们可以尝试另一个 yahoo 未更改的页面,如下所示:

For an example of the expected outcome, we can try another page yahoo hasn't changed such as the following:

 # Create a URL string
myURL2 <-  "https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL"

df2 <- myURL2 %>%
  read_html() %>%
  html_table(header = FALSE) %>%
  map_df(bind_cols) %>%

如果您查看 df2,您会得到 59 个观察值的 59 个观察值,其中两个变量是该页面上的主表,以

If you view df2 you get a tibble of 59 observations of two variables being the main table on that page, beginning with

市值(日内)5 [此处的价值]企业价值 3 [此处的价值]等等……

Market Cap (intraday)5 [value here]Enterprise value 3 [value here]And so on...



As mentioned in the comment above, here is an alternative that tries to deal with the different table sizes published. I have worked on this and have had help from a friend.


url <- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL

# Download the data
raw_table <- read_html(url) %>% html_nodes("div.D\(tbr\)")

number_of_columns <- raw_table[1] %>% html_nodes("span") %>% length()

if(number_of_columns > 1){
  # Create empty data frame with the required dimentions
  df <- data.frame(matrix(ncol = number_of_columns, nrow = length(raw_table)),
                      stringsAsFactors = F)

  # Fill the table looping through rows
  for (i in 1:length(raw_table)) {
    # Find the row name and set it.
    df[i, 1] <- raw_table[i] %>% html_nodes("div.Ta\(start\)") %>% html_text()
    # Now grab the values
    row_values <- raw_table[i] %>% html_nodes("div.Ta\(end\)")
    for (j in 1:(number_of_columns - 1)) {
      df[i, j+1] <- row_values[j] %>% html_text()

这篇关于R:2019 年变更后的 yahoo.finance 网络抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 04:23