问题描述
对 r 有点陌生,我一直在做一个项目(只是为了好玩)来帮助我学习,但我遇到了一些我似乎无法在网上找到答案的问题.我正在尝试自学如何从网站上抓取数据,我从下面的代码开始,该代码从 247 项运动中检索了一些数据.
Slightly new to r and I've been working on a project (just for fun) to help me learn and I'm running into something that I can't seem to find answers for online. I am trying to teach myself to scrape websites for data, and I've started with the code below that retrieves some data from 247 sports.
library(rvest)
library(stringr)
link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"
link.scrap <- read_html(link)
data <-
html_nodes(x = link.scrap,
css = '#page-content > div.main-div.clearfix > section.list-page > section > section > ul.content-list.ri-list > li:nth-child(3)') %>%
html_text(trim = TRUE) %>%
trimws()
当我查看数据时,它似乎是一个长度为 1 的向量,多个列表项存储为一个值.我遇到的问题是试图将它们分成各自的列.例如,当我运行下面的代码时,我认为应该在)"处拆分数据,然后从两个结果值中删除空格,我得到了一个奇怪的结果.
When I view the data it appears to be a vector of length 1, with multiple list items stored as one value. The problem I'm running into is trying to separate these out into their respective columns. For example, when I run the code below which I think should split the data at ")" and then remove the white spaces from both of the resulting values, I get a weird result.
f<-strsplit(data,")")
str_trim(f)
[1] "c(\"Ray Lima El Camino College (Torrance, CA\", \" DT 6-3 310 0.8681 39 4 9 Enrolled 1/9/2017\")"
我搞砸了其他一些事情,但没有成功.所以我想我的问题是,从这个 html 列表中获取数据并将其转换为每个数据点都有自己的列(即姓名、大学、职位、统计信息等)的格式的最佳方法是什么?
I have messed around with a few other things but with no success. So I guess my question is, what would be the best way to take data from this html list and get it into a format where every data point has it's own column (i.e. name, college, position, stats, etc)?
推荐答案
我修改了您的代码中的一些内容.
I've modified a couple of things in your code.
采用通用方法来引用 css,因此能够提取整行.
Taken a generic approach to refer the css and hence able to extract for the entire rows.
收集单个列作为向量,然后构建一个数据框
Collected individual columns as vectors and then built a dataframe
请检查
library(rvest)
library(stringr)
library(tidyr)
link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"
link.scrap <- read_html(link)
names <- link.scrap %>% html_nodes('div.name') %>% html_text()
pos <- link.scrap %>% html_nodes('ul.metrics-list') %>% html_text()
status <- link.scrap %>% html_nodes('div.right-content.right') %>% html_text()
data <- data.frame(names,pos,status, stringsAsFactors = F)
data <- data[-1,]
head(data)
> head(data)
names pos status
2 Kamilo Tongamoa Merced College (Merced, CA) DT 6-5 320 Enrolled 8/24/2017
3 Ray Lima El Camino College (Torrance, CA) DT 6-3 310 Enrolled 1/9/2017
4 O'Rien Vance George Washington (Cedar Rapids, IA) OLB 6-3 235 Enrolled 6/12/2017
5 Matt Leo Arizona Western College (Yuma, AZ) WDE 6-7 265 Enrolled 2/22/2017
6 Keontae Jones Colerain (Cincinnati, OH) S 6-1 175 Enrolled 6/12/2017
7 Cordarrius Bailey Clarksdale (Clarksdale, MS) WDE 6-4 210 Enrolled 6/12/2017
>
这篇关于清理从 Web 上抓取的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!