我正在使用rvest从内部网站的HTML表中抓取数据.行的颜色是有意义的,因此我想将 BGCOLOR
属性提取为最终表中的一列,但是当然 html_table()
I'm using rvest to scrape data from an internal website's HTML tables. The color of the rows is meaningful, so I want to extract the BGCOLOR
attribute as a column in my final table, but of course html_table()
only extracts the content.
Here's what I have so far. A snippet of the html table is below. How can I include a column for color?
tbl_content <- samplepage %>%
html_nodes("table") %>%
html_table(fill = TRUE, trim = TRUE)
<tr BGCOLOR = "#F8C0E0">
<td> BASOPHILS <td> microl <td> 0.477 <td> 0.425 <td align="center"> 0.052 <td align="center"> 1.920 <td align="center"> 51.5 <td align="center"> 32
<tr BGCOLOR = "#F8F0B0">
<td> CALCIUM <td > mg/dl <td> 12.2 <td> 1.7 <td align="center"> 7.6 <td align="center"> 14.9 <td align="center"> 71 <td align="center"> 33
您可以构建自己的解析器来替换 html_table
. purrr :: map_df
在迭代节点(在这种情况下为 tr
You can build your own parser to replace html_table
. purrr::map_df
is handy for iterating over nodes (tr
s in this case) and combining the results into a data.frame:
html <- '<tr BGCOLOR = "#F8C0E0">
<td> BASOPHILS <td> microl <td> 0.477 <td> 0.425 <td align="center"> 0.052 <td align="center"> 1.920 <td align="center"> 51.5 <td align="center"> 32
<tr BGCOLOR = "#F8F0B0">
<td> CALCIUM <td > mg/dl <td> 12.2 <td> 1.7 <td align="center"> 7.6 <td align="center"> 14.9 <td align="center"> 71 <td align="center"> 33
parsed_df <- html %>%
read_html() %>%
html_nodes('tr') %>%
map_df(~bind_cols(data_frame(bgcolor = html_attr(.x, 'bgcolor')), # grab attribute
# extract each row's values to 1-row data.frame
html_nodes(.x, 'td') %>%
html_text(trim = TRUE) %>%
set_names(paste0('x', seq_along(.))) %>% # or `%>% t() %>% as_data_frame()`
invoke(data_frame, .))) %>%
type_convert() # clean up types
#> # A tibble: 2 x 9
#> bgcolor x1 x2 x3 x4 x5 x6 x7 x8
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 #F8C0E0 BASOPHILS microl 0.477 0.425 0.052 1.92 51.5 32
#> 2 #F8F0B0 CALCIUM mg/dl 12.200 1.700 7.600 14.90 71.0 33
更简单但不太灵活,您只需拉出属性,然后将其合并到 html_table
More simply but less flexibly, you can just pull out the attribute and then merge it to the results of html_table
paste('<table>', html, '</table>') %>% # `html_table` needs a <table> tag
read_html() %>%
data.frame(bgcolor = html_nodes(., 'tr') %>% html_attr('bgcolor'),
#> bgcolor X1 X2 X3 X4 X5 X6 X7 X8
#> 1 #F8C0E0 BASOPHILS microl 0.477 0.425 0.052 1.92 51.5 32
#> 2 #F8F0B0 CALCIUM mg/dl 12.200 1.700 7.600 14.90 71.0 33