问题描述
如何抓取 70 页的 html 数据?我正在查看这个问题,但我被困在通用方法部分的功能.
How can I scrape html data of 70 pages? I was looking at this question but I am stuck at the function of the general method section.
#attempt
library(purrr)
url_base <-"https://secure.capitalbikeshare.com/profile/trips/QNURCMF2Q6"
map_df(1:70, function(i) {
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame( startd=html_text(html_nodes(pg, ".ed-table__col_trip-start-date")),
endd=html_text(html_nodes(pg,".ed-table__col_trip-end-date")),
duration=html_text(html_nodes(pg, ".ed-table__col_trip-duration"))
)
}) -> table
#attempt 2 (with just one data column)
url_base <-"https://secure.capitalbikeshare.com/profile/trips/QNURCMF2Q6"
map_df(1:70, function(i) {
page %>% html_nodes(".ed-table__item_odd") %>% html_text()
}) -> table
推荐答案
@jso1226,我不确定您引用的答案中发生了什么,所以我提供了一个与您想要执行的任务非常相似的示例.
@jso1226, I am not sure what was going on in the answer your referenced so I am providing an example very similar task to what you want to do.
即:转到网页收集信息,将其添加到数据框,然后移至下一页.
Which is: Go to a web page collect information, add it a dataframe and then move to the next page.
我使用创建的此代码来跟踪我在此处发布到 stackoverflow 的答案:
I use this code created to track my answers posted here to stackoverflow:
login<-"https://stackoverflow.com/users/login?ssrc=head&returnurl=http%3a%2f%2fstackoverflow.com%2f"
library(rvest)
pgsession<-html_session(login)
pgform<-html_form(pgsession)[[2]]
filled_form<-set_values(pgform, email="*****", password="*****")
submit_form(pgsession, filled_form)
#pre allocate the final results dataframe.
results<-data.frame()
for (i in 1:5)
{
url<-"http://stackoverflow.com/users/**********?tab=answers&sort=activity&page="
url<-paste0(url, i)
page<-jump_to(pgsession, url)
#collect question votes and question title
summary<-html_nodes(page, "div .answer-summary")
question<-matrix(html_text(html_nodes(summary, "div"), trim=TRUE), ncol=2, byrow = TRUE)
#find date answered, hyperlink and whether it was accepted
dateans<-html_node(summary, "span") %>% html_attr("title")
hyperlink<-html_node(summary, "div a") %>% html_attr("href")
accepted<-html_node(summary, "div") %>% html_attr("class")
#create temp results then bind to final results
rtemp<-cbind(question, dateans, accepted, hyperlink)
results<-rbind(results, rtemp)
}
#Dataframe Clean-up
names(results)<-c("Votes", "Answer", "Date", "Accepted", "HyperLink")
results$Votes<-as.integer(as.character(results$Votes))
results$Accepted<-ifelse(results$Accepted=="answer-votes default", 0, 1)
这种情况下的循环仅限于 5 页,这需要更改以适合您的应用程序.我用 ****** 替换了用户特定的值,希望这可以为您的问题提供一些指导.
The loop in this case is limited to only 5 pages, this needs to change to fit your application. I replaced the user specific values with ******, hopefully this will provide some guidance for you problem.
这篇关于使用 R 连续抓取多个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!