问题描述
我无法从 url 读取数据 https://www.basketball-reference.com/leagues/NBA_2020_totals.html#totals_stats::pts.代码如下:
I'm having trouble reading the data from the url https://www.basketball-reference.com/leagues/NBA_2020_totals.html#totals_stats::pts. Here's the code:
library(rvest)
url <- "https://www.basketball-reference.com/leagues/NBA_2020_totals.html#totals_stats::pts"
pagina <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE,
encoding = "utf-8")
pagina %>%
html_nodes("table") %>%
.[[1]] %>%
html_table(fill=T) -> x
这会读取表格,但我不知道为什么会像这样粘贴几行:
This reads the table, but I don't know why it paste a few rows like this:
Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
54 Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
77 Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
102 Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
133 Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
162 Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
189 Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
218 Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
我得到了球员行,但我也得到了那些行.我不知道这些行是否也是没有被很好阅读的玩家,或者它们只是粘贴的随机行,因为我在代码中做错了.我想删除那些行(如您所见,它们位于随机位置)或修改读取代码,以便我没有得到它们.
I get the players rows but I also get those rows. I don't know if those rows are also players that aren't being read well or they are just random rows that are pasted because I'm doing something wrong in the code. I want either to remove those rows (which are in random positions as you can see) or modify the read code so I don't get them.
提前致谢.
阿尔贝托
推荐答案
你应该忽略那些行,只获取相关的行.
You should ignore those rows and take only the relevant rows.
library(rvest)
library(dplyr)
url <- "https://www.basketball-reference.com/leagues/NBA_2020_totals.html"
webpage <- url %>% read_html
webpage %>%
html_table() %>%
.[[1]] %>%
filter(!grepl('Rk', Rk)) %>%
type.convert(as.is = TRUE)
# Rk Player Pos Age Tm G GS MP FG FGA FG% ...
#1 1 Steven Adams C 26 OKC 58 58 1564 262 443 0.591 ...
#2 2 Bam Adebayo PF 22 MIA 65 65 2235 408 719 0.567 ...
#3 3 LaMarcus Aldridge C 34 SAS 53 53 1754 391 793 0.493 ...
#4 4 Nickeil Alexander-Walker SG 21 NOP 41 0 501 77 227 0.339 ...
#5 5 Grayson Allen SG 24 MEM 30 0 498 79 176 0.449 ...
#6 6 Jarrett Allen C 21 BRK 64 58 1647 267 413 0.646 ...
#7 7 Kadeem Allen SG 27 NYK 10 0 117 19 44 0.432 ...
#8 8 Al-Farouq Aminu PF 29 ORL 18 2 380 25 86 0.291 ...
#9 9 Justin Anderson SF 26 BRK 3 0 17 1 6 0.167 ...
#10 10 Kyle Anderson PF 26 MEM 59 20 1140 138 280 0.493 ...
#11 11 Ryan Anderson PF 31 HOU 2 0 14 2 7 0.286 ...
#...
#...
这篇关于网络抓取问题篮球运动员的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!