问题描述
如何从HTML删除pdf文档?我正在使用R,并且只能从HTML中提取文本.我要剪贴的网站示例如下.
How can I scrap the pdf documents from HTML? I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrap is as follows.
致谢
推荐答案
当您说要从HTML页面中抓取PDF文件时,我认为您面临的第一个问题是实际识别这些PDF文件的位置./p>
When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files.
library(XML)
library(RCurl)
url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
page <- getURL(url)
parsed <- htmlParse(page)
links <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds <- grep("*.pdf", links)
links <- links[inds]
links
包含您要下载的PDF文件的所有URL.
links
contains all the URLs to the PDF-files you are trying to download.
提防:当您自动抓取文档并被阻止时,许多网站都不喜欢它.
Beware: many websites don't like it very much when you automatically scrape their documents and you get blocked.
有了这些链接,您就可以开始循环浏览这些链接,并一一下载它们,并将其以destination
的名称保存在您的工作目录中.我决定根据链接为您的PDF提取合理的文档名称(提取网址中最后一个/
之后的最后一段
With the links in place, you can start looping through the links and download them one by one and saving them in your working directory under the name destination
. I decided to extract reasonable document names for your PDFs, based on the links (extracting the final piece after the last /
in the urls
regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)
为避免网站的服务器超负荷,我听说偶尔暂停一下抓取操作是很友好的,因此,我使用Sys.sleep()将抓取操作暂停了0到5秒之间的时间:
To avoid overloading the servers of the website, I have heard it is friendly to pause your scraping every once in a while, so therefore I use 'Sys.sleep()` to pause scraping for a time between 0 and 5 seconds:
for(i in seq_along(links)){
download.file(links[i], destfile=destination[i])
Sys.sleep(runif(1, 1, 5))
}
这篇关于从HTML网站抓取pdf文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!