问题描述
我正在使用rvest从内部网站的HTML表中抓取数据.行的颜色是有意义的,因此我想将 BGCOLOR
属性提取为最终表中的一列,但是当然 html_table()
仅提取内容.
I'm using rvest to scrape data from an internal website's HTML tables. The color of the rows is meaningful, so I want to extract the BGCOLOR
attribute as a column in my final table, but of course html_table()
only extracts the content.
这是我到目前为止所拥有的.以下是html表的代码段.如何添加颜色列?
Here's what I have so far. A snippet of the html table is below. How can I include a column for color?
html_nodes(samplepage,"table")
tbl_content <- samplepage %>%
html_nodes("table") %>%
html_table(fill = TRUE, trim = TRUE)
tbl_content
<tr BGCOLOR = "#F8C0E0">
<td> BASOPHILS <td> microl <td> 0.477 <td> 0.425 <td align="center"> 0.052 <td align="center"> 1.920 <td align="center"> 51.5 <td align="center"> 32
</tr>
<tr BGCOLOR = "#F8F0B0">
<td> CALCIUM <td > mg/dl <td> 12.2 <td> 1.7 <td align="center"> 7.6 <td align="center"> 14.9 <td align="center"> 71 <td align="center"> 33
</tr>
推荐答案
您可以构建自己的解析器来替换 html_table
. purrr :: map_df
在迭代节点(在这种情况下为 tr
s)并将结果组合到data.frame中非常方便.
You can build your own parser to replace html_table
. purrr::map_df
is handy for iterating over nodes (tr
s in this case) and combining the results into a data.frame:
library(rvest)
library(tidyverse)
html <- '<tr BGCOLOR = "#F8C0E0">
<td> BASOPHILS <td> microl <td> 0.477 <td> 0.425 <td align="center"> 0.052 <td align="center"> 1.920 <td align="center"> 51.5 <td align="center"> 32
</tr>
<tr BGCOLOR = "#F8F0B0">
<td> CALCIUM <td > mg/dl <td> 12.2 <td> 1.7 <td align="center"> 7.6 <td align="center"> 14.9 <td align="center"> 71 <td align="center"> 33
</tr>'
parsed_df <- html %>%
read_html() %>%
html_nodes('tr') %>%
map_df(~bind_cols(data_frame(bgcolor = html_attr(.x, 'bgcolor')), # grab attribute
# extract each row's values to 1-row data.frame
html_nodes(.x, 'td') %>%
html_text(trim = TRUE) %>%
set_names(paste0('x', seq_along(.))) %>% # or `%>% t() %>% as_data_frame()`
invoke(data_frame, .))) %>%
type_convert() # clean up types
parsed_df
#> # A tibble: 2 x 9
#> bgcolor x1 x2 x3 x4 x5 x6 x7 x8
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 #F8C0E0 BASOPHILS microl 0.477 0.425 0.052 1.92 51.5 32
#> 2 #F8F0B0 CALCIUM mg/dl 12.200 1.700 7.600 14.90 71.0 33
更简单但不太灵活,您只需拉出属性,然后将其合并到 html_table
的结果中即可:
More simply but less flexibly, you can just pull out the attribute and then merge it to the results of html_table
:
paste('<table>', html, '</table>') %>% # `html_table` needs a <table> tag
read_html() %>%
{
data.frame(bgcolor = html_nodes(., 'tr') %>% html_attr('bgcolor'),
html_table(.))
}
#> bgcolor X1 X2 X3 X4 X5 X6 X7 X8
#> 1 #F8C0E0 BASOPHILS microl 0.477 0.425 0.052 1.92 51.5 32
#> 2 #F8F0B0 CALCIUM mg/dl 12.200 1.700 7.600 14.90 71.0 33
这篇关于如何在网络抓取的html表中包含属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!