我正试图从这个网页上的“团队统计”表中获取数据:https://www.hockey-reference.com/teams/CGY/2010.html
我对web抓取没有太多经验,但是尝试了一些xml包,现在又尝试了rvest包:
library(rvest)
url <- html("https://www.hockey-reference.com/teams/CGY/2010.html")
url %>%
html_node(xpath = "//*[@id='team_stats']")
最后出现了一个看似单一的节点:
{xml_node}
<table class="sortable stats_table" id="team_stats" data-cols-to-freeze="1">
[1] <caption>Team Statistics Table</caption>
[2] <colgroup>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\ ...
[3] <thead><tr>\n<th aria-label="Team" data-stat="team_name" sco ...
[4] <tbody>\n<tr>\n<th scope="row" class="left " data-stat="team ...
如何解析它以获取两行表中的标题和信息?
最佳答案
您只需在链的末尾添加html_table
:
library(rvest)
url <- read_html("https://www.hockey-reference.com/teams/CGY/2010.html")
url %>%
html_node(xpath = "//*[@id='team_stats']") %>%
html_table()
或者:
library(rvest)
url %>%
html_table() %>%
.[[1]]
两种解决方案都返回:
Team AvAge GP W L OL PTS PTS% GF GA SRS SOS TG/G PP PPO PP% PPA PPOA PK% SH SHA S S% SA SV% PDO
1 Calgary Flames 28.8 82 40 32 10 90 0.549 201 203 -0.03 0.04 5.05 43 268 16.04 54 305 82.30 7 1 2350 8.6 2367 0.916 100.1
2 League Average 27.9 82 41 31 10 92 0.561 233 233 0.00 0.00 5.68 56 304 18.23 56 304 81.77 6 6 2486 9.1 2479 0.911 NA
关于r - R-用rvest包刮,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/50146342/