问题描述
我必须从互联网下载多个关于一个国家的普查数据的xlsx文件。文件位于这个
。问题是:
- 我是无法编写一个循环,让我来回下载
- 正在下载的文件有一些怪异的名字,不是区域名称。那么我怎么可以动态地把它改成区名。
我使用了下面提到的代码:
url< - http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx
download.file(url,HLPCA-28532-2011_H14_census .xlsx,mode =wb)
但是这一次下载一个文件,不会更改文件名。 >
提前感谢
假设你想要所有的数据,而不知道全部的URL,您的任务涉及到webparsing。软件包httr为检索给定网站的HTML代码提供了有用的功能,您可以解析链接。
也许这一点代码正在寻找:
库(httr)
pre>
base_url =http://www.censusindia.gov.in / $ / $ / $ / $#$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
rcl = unlist(strsplit(rc,< a href = \\\))#find links
rcl = rcl [grepl(Houselisting-housing - 。+?\ \.html,rcl)]#找到家庭列表的链接
names = gsub(^。+?>(。+?)< /.+$,\\\ \\,rcl)#获取名称
names = gsub(^ \\s + | \\s + $,,名称)#trim名称
links = gsub( ^(Houselisting-housing - 。+?\\.html)。+ $,\\1,rcl)#获取链接
#迭代地区
for(i in 1:length(links)){
url_hh = paste0(base_url,HL_PCA /,li nks [i])
if(!url_success(url_hh))next
r< - GET(url_hh)
rc = content(r,text)
rcl = unlist(strsplit(rc,< a href = \\\))#find links
rcl = rcl [grepl(。xlsx,rcl)]#查找链接家庭列表
hh_names = gsub(^。+?>(。+?)< /.+$,\\1,rcl)#获取名称
hh_names = gsub(^ \\s + | \\s + $,,hh_names)#trim names
hh_links = gsub(^(。+?\\.xlsx))。 + $,\\1,rcl)#获取链接
#遍历子区域
for(j in 1:length(hh_links)){
url_xlsx = paste0(base_url,HL_PCA /,hh_links [j])
if(!url_success(url_xlsx))next
filename = paste0(names [i],_,hh_names [j],.xlsx)
download.file(url_xlsx,filename,mode =wb)
}
}
I have to download multiple xlsx files about a country's census data from internet using R. Files are located in this Link .The problems are:
- I am unable to write a loop which will let me go back and forth to download
- File being download has some weird name not districts name. So how can I change it to districts name dynamically.
I have used the below mentioned codes:
url<-"http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx" download.file(url, "HLPCA-28532-2011_H14_census.xlsx", mode="wb")
But this downloads one file at a time and doesnt change the file name.
Thanks in advance.
解决方案Assuming you want all the data without knowing all of the urls, your questing involves webparsing. Package httr provides useful function for retrieving HTML-code of a given website, which you can parse for links.
Maybe this bit of code is what you're looking for:
library(httr) base_url = "http://www.censusindia.gov.in/2011census/HLO/" # main website r <- GET(paste0(base_url, "HL_PCA/Houselisting-housing-HLPCA.html")) rc = content(r, "text") rcl = unlist(strsplit(rc, "<a href =\\\"")) # find links rcl = rcl[grepl("Houselisting-housing-.+?\\.html", rcl)] # find links to houslistings names = gsub("^.+?>(.+?)</.+$", "\\1",rcl) # get names names = gsub("^\\s+|\\s+$", "", names) # trim names links = gsub("^(Houselisting-housing-.+?\\.html).+$", "\\1",rcl) # get links # iterate over regions for(i in 1:length(links)) { url_hh = paste0(base_url, "HL_PCA/", links[i]) if(!url_success(url_hh)) next r <- GET(url_hh) rc = content(r, "text") rcl = unlist(strsplit(rc, "<a href =\\\"")) # find links rcl = rcl[grepl(".xlsx", rcl)] # find links to houslistings hh_names = gsub("^.+?>(.+?)</.+$", "\\1",rcl) # get names hh_names = gsub("^\\s+|\\s+$", "", hh_names) # trim names hh_links = gsub("^(.+?\\.xlsx).+$", "\\1",rcl) # get links # iterate over subregions for(j in 1:length(hh_links)) { url_xlsx = paste0(base_url, "HL_PCA/",hh_links[j]) if(!url_success(url_xlsx)) next filename = paste0(names[i], "_", hh_names[j], ".xlsx") download.file(url_xlsx, filename, mode="wb") } }
这篇关于如何使用R中的循环下载多个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!