我正试图(相当不成功地)从一个网站(www.majidata.co.ke)上用R刮取一些数据。我已经设法刮取HTML并解析它,但现在有点不确定如何提取我真正需要的位!
使用XML
库,我使用以下代码刮取数据:
majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
majidata_html <- htmlTreeParse(content(majidata_get, as="text"))
这就剩下(大)XMLDocumentContent了。网页上有一个下拉列表,我想从中筛选出值(与不同城镇的名称和ID号相关)。我要提取的位是
<option value ="XXX">
和它后面的大写字母之间的数字。<div class="regiondata">
<div id="town_data">
<select id="town" name="town" onchange="town_data(this.value);">
<option value="0" selected="selected">[SELECT TOWN]</option>
<option value="611">AHERO</option>
<option value="635">AKALA</option>
<option value="625">AWASI</option>
<option value="628">AWENDO</option>
<option value="749">BAHATI</option>
<option value="327">BANGALE</option>
理想情况下,我希望将它们放在data.frame中,其中第一列是数字,第二列是名称,例如。
ID Name
611 AHERO
635 AKALA
625 AWASI
等。
我真的不知道从这里到哪里去。我曾想过使用regex并匹配文本中的模式,尽管我从许多论坛上读到这是一个坏主意,使用xpath更好/更有效。不过,我不太确定从哪里开始,只是觉得我需要以某种方式使用
xpathApply
。 最佳答案
非常新的rvest包可以快速完成这项工作,并允许您使用正常的CSS选择器。
更新后包含第二个请求(见下面的评论)
library(rvest)
library(dplyr)
# gets data from the second popup
# returns a data frame of town_id, town_name, area_id, area_name
addArea <- function(town_id, town_name) {
# make the AJAX URL and grab the data
url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s",
town_id)
subunits <- html(url)
# reformat into a data frame with the town data
data.frame(town_id=town_id,
town_name=town_name,
area_id=subunits %>% html_nodes("option") %>% html_attr("value"),
area_name=subunits %>% html_nodes("option") %>% html_text(),
stringsAsFactors=FALSE)[-1,]
}
# get data from the first popup and put it into a dat a frame
majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"),
town_name=majidata %>% html_nodes("#town option") %>% html_text(),
stringsAsFactors=FALSE)[-1,]
# pass in the name and id to our addArea function and make the result into
# a data frame with all the data (town and area)
combined <- do.call("rbind.data.frame",
mapply(addArea, maji$town_id, maji$town_name,
SIMPLIFY=FALSE, USE.NAMES=FALSE))
# row names aren't super-important, but let's keep them tidy
rownames(combined) <- NULL
str(combined)
## 'data.frame': 1964 obs. of 4 variables:
## $ town_id : chr "611" "635" "625" "628" ...
## $ town_name: chr "AHERO" "AKALA" "AWASI" "AWENDO" ...
## $ area_id : chr "60603030101" "60107050201" "60603020101" "61103040101" ...
## $ area_name: chr "AHERO" "AKALA" "AWASI" "ANINDO" ...
head(combined)
## town_id town_name area_id area_name
## 1 611 AHERO 60603030101 AHERO
## 2 635 AKALA 60107050201 AKALA
## 3 625 AWASI 60603020101 AWASI
## 4 628 AWENDO 61103040101 ANINDO
## 5 628 AWENDO 61103050401 SARE
## 6 749 BAHATI 73101010101 BAHATI