问题描述
我正在尝试使用 rvest 从 politco 的网站获取一些选举结果.
I'm trying to grab some election results from politco's website using rvest.
http://www.politico.com/2016-选举/结果/地图/总统/威斯康星州/
我无法一次拉取页面上的所有数据,所以我选择了县级方法.每个县都有一个唯一的 css 选择器(例如亚当斯县的选择器是:'#countyAdams .results-table').所以我从其他地方获取了所有县名并设置了一个快速循环(是的,我知道循环在 R 中是不好的做法,但我预计这种方法需要我大约 3 分钟).
I couldn't pull all the data on the page at once, so I went for a county-level approach. Each county has a unique css selector (e.g Adams County's is: '#countyAdams .results-table'). So I grabbed all the county names from elsewhere and set up a quick loop (yes I know loops are bad practice in R but I anticipated this method taking me about 3 minutes).
抓取网址
wiscoSixteen <- read_html("http://www.politico.com/2016-election/results/map/president/wisconsin")
创建一个空的 data.frame(不,我没有预先定义列)
Create an empty data.frame (and no I didn't pre-define the columns)
stateDf <- NULL
获取县列表(这不是完整的,但为了让例行程序中断,我们不需要所有 70 个县)
Get the list of counties (this isn't complete but to get to the point the routine breaks we don't need all 70 counties)
wiscoCounties <- c("Adams", "Ashland", "Barron", "Bayfield", "Brown", "Buffalo", "Burnett", "Calumet", "Chippewa", "Clark", "Columbia", "Crawford", "Dane", "Dodge", "Door", "Douglas", "Dunn", "Eau Claire", "Florence", "Fond du Lac", "Forest", "Grant", "Green", "Green Lake", "Iowa", "Iron", "Jackson", "Jefferson", "Juneau")
我的for"循环:
for (i in 1:length(wiscoCounties)){
#Pull out the i'th county name and paste it in a string
wiscoResult <- wiscoSixteen %>% html_node(paste("#county"," .results-table", sep=wiscoCounties[i])) %>% html_table()
#add a column for the county name so I can ID later
wiscoResult[,4] <- wiscoCounties[i]
#then rbind
stateDf <- rbind(stateDf, wiscoResult)
}
当它通过第 10 个县时,它会停止并返回错误:没有匹配项".
When it gets through the 10th county it stops and returns 'Error: No matches'.
找不到关于第 11 个县哥伦比亚"的任何独特之处.对正在发生的事情不知所措.我敢肯定这是愚蠢的事情,因为通常情况如此.任何帮助表示赞赏.
Can't find anything unique about 'Columbia', the 11th county. At a loss for what's happening. I'm sure it's something stupid as that's usually the case. Any help is appreciated.
推荐答案
那么,为什么不直接使用最终填充这些表的 XHR 请求(我有点惊讶您从它们那里获取任何数据,因为它们从单独的数据请求中生成):
So, why not just use the XHR requests that end up populating those tables (I'm kinda surprised you're getting any data at all from them since they get generated from a separate data request):
library(httr)
library(stringi)
library(purrr)
library(dplyr)
res <- GET("http://s3.amazonaws.com/origin-east-elections.politico.com/mapdata/2016/WI_20161108.xml")
dat <- readLines(textConnection(content(res, as="text")))
stri_split_fixed(dat[2], "|")[[1]] %>%
stri_replace_last_fixed(";", "") %>%
stri_split_fixed(";", 3) %>%
map_df(~setNames(as.list(.), c("rep_id", "first", "last"))) -> candidates
dat[stri_detect_regex(dat, "^WI;P;G")] %>%
stri_replace_first_regex("^WI;P;G;", "") %>%
map_df(function(x) {
county_results <- stri_split_fixed(x, "||", 2)[[1]]
stri_replace_last_fixed(county_results[1], ";;", "") %>%
stri_split_fixed(";") %>%
map_df(~setNames(as.list(.), c("fips", "name", "x1", "reporting", "x2", "x3", "x4"))) -> county_prefix
stri_split_fixed(county_results[2], "|")[[1]] %>%
stri_split_fixed(";") %>%
map_df(~setNames(as.list(.), c("rep_id", "party", "count", "pct", "x5", "x6", "x7", "x8", "candidate_idx"))) %>%
left_join(candidates, by="rep_id") -> df
df$fips <- county_prefix$fips
df$name <- county_prefix$name
df$reporting <- county_prefix$reporting
select(df, -starts_with("x"))
}) -> results
好像是完整的数据:
glimpse(results)
## Observations: 511
## Variables: 10
## $ rep_id <chr> "WI270631108", "WI270621108", "WI270691108", "WI270711108", "WI270701108", "WI270731108", "WI270721108",...
## $ party <chr> "Dem", "GOP", "Lib", "CST", "ADP", "WW", "Grn", "Dem", "GOP", "Lib", "CST", "ADP", "WW", "Grn", "Dem", "...
## $ count <chr> "1382210", "1409467", "106442", "12179", "1561", "1781", "30980", "3780", "5983", "207", "44", "4", "9",...
## $ pct <chr> "46.9", "47.9", "3.6", "0.4", "0.1", "0.1", "1.1", "37.4", "59.2", "2.0", "0.4", "0.0", "0.1", "0.8", "5...
## $ candidate_idx <chr> "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7",...
## $ first <chr> "Clinton", "Trump", "Johnson", "Castle", "De La Fuente", "Moorehead", "Stein", "Clinton", "Trump", "John...
## $ last <chr> "Hillary", "Donald", "Gary", "Darrell", "Rocky", "Monica", "Jill", "Hillary", "Donald", "Gary", "Darrell...
## $ fips <chr> "0", "0", "0", "0", "0", "0", "0", "55001", "55001", "55001", "55001", "55001", "55001", "55001", "55003...
## $ name <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Adams", "Ada...
## $ reporting <chr> "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100....
尽管 URL 上有.xml"扩展名,但它不是 XML 数据.我也不知道某些列实际上是什么,但您可以深入研究.此外,还有另一部分数据:
Despite the ".xml" extension on the URL, it's not XML data. I also don't know what some of the columns actually are, but you can dig into that. Also, there's a whole other section of data:
WI;S;G;0;Wisconsin;X;100.0;X;;50885;;||WI269201108;Dem;1380496;46.8;;X;;;1|WI267231108;GOP;1479262;50.2;X;X;X;;2|WI270541108;Lib;87291;3.0;;X;;;3
WI;S;G;55001;Adams;X;100.0;X;;50885;;||WI269201108;Dem;4093;41.2;;X;;;1|WI267231108;GOP;5346;53.9;X;X;X;;2|WI270541108;Lib;486;4.9;;X;;;3
WI;S;G;55003;Ashland;X;100.0;X;;50885;;||WI269201108;Dem;4349;55.1;;X;;;1|WI267231108;GOP;3337;42.2;X;X;X;;2|WI270541108;Lib;214;2.7;;X;;;3
WI;S;G;55005;Barron;X;100.0;X;;50885;;||WI269201108;Dem;8691;38.8;;X;;;1|WI267231108;GOP;12863;57.4;X;X;X;;2|WI270541108;Lib;853;3.8;;X;;;3
WI;S;G;55007;Bayfield;X;100.0;X;;50885;;||WI269201108;Dem;5161;54.6;;X;;;1|WI267231108;GOP;4022;42.6;X;X;X;;2|WI270541108;Lib;263;2.8;;X;;;3
WI;S;G;55009;Brown;X;100.0;X;;50885;;||WI269201108;Dem;51004;40.0;;X;;;1|WI267231108;GOP;71750;56.3;X;X;X;;2|WI270541108;Lib;4615;3.6;;X;;;3
WI;S;G;55011;Buffalo;X;100.0;X;;50885;;||WI269201108;Dem;2746;39.9;;X;;;1|WI267231108;GOP;3850;56.0;X;X;X;;2|WI270541108;Lib;285;4.1;;X;;;3
WI;S;G;55013;Burnett;X;100.0;X;;50885;;||WI269201108;Dem;3143;37.4;;X;;;1|WI267231108;GOP;4998;59.5;X;X;X;;2|WI270541108;Lib;258;3.1;;X;;;3
这显然意味着该页面的某些内容(这很明显,但我对选举感到厌倦以至于我已经完成了数据)并且您可以按照与上述类似的方式进行处理.
which obviously means something for that page (it's kinda obvious, but I'm so weary from the election that I'm kinda done with the data) and you can process in similar fashion as what is above.
这篇关于使用 rvest 抓取数据返回 Nomatches的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!