本文介绍了使用R中的download.file下载时跳过错误文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有更多的pdf文件链接,我想在for循环中使用download.file下载.我的解决方案工作正常,但是遇到错误(许多文件不起作用)时,它会停止.我想在我的download.file函数中添加一个功能,该功能告诉R如果下载后产生错误,则跳过文件,并打印一条消息,其中包含遇到错误的页面的名称.

I have a larger number of links to pdf files that I would like to download using download.file in a for loop. My solution works fine, but it stops when it encounters an error (a number of the files does not work). I would like to add a feature to my download.file function that tells R to skip a file if downloaded yields an error and to print a message with the names of the pages for which an error was encountered.

在这种情况下,我发现tryCatch可能是一个很好的解决方案,但是我不确定要在哪里放置它(我尝试了多种方法,但均无济于事).

I found that tryCatch is likely a good solution in this case, but I am not entirely sure where to place it (I have tried a number of ways, but neither worked).

这是我的代码:

for (i in length(files) {

# Reads the html links
  html <- read_html(files[i])
  reads_name <- html_nodes(html, 'h1')
  name <- trimws(html_text(reads_name) )

# Extracts the pdf. link from all links that the webpage contains
  webpagelinks <- html_attr(html_nodes(html, "a"), "href")
  extract_pdf_link <- webpagelinks[grepl("\\pdf", webpagelinks)]

# downloads the pdf file from the pdf link, here is where I get the error
  download.file(extract_pdf_link, destfile = paste0(name, "_paper.pdf") ,
mode = "wb")

  skip_with_message = simpleError('Did not work out')
  tryCatch(print(name), error = function(e) skip_with_message)

  }

有关如何解决此问题的任何建议?

Any suggestions on how to solve this?

非常感谢!

推荐答案

download.file 放入 tryCatch 内.例如

files <- c("http://foo.com/bar.pdf", "http://www.orimi.com/pdf-test.pdf", "http://bar.com/foo.pdf")
oldw <- getOption("warn")
options(warn = -1)
for (file in files) {
    tryCatch(download.file(file, tempfile(), mode = "wb", quiet = FALSE),
        error = function(e) print(paste(file, 'did not work out')))
}
options(warn = oldw)

我一开始使用 options(warn = -1)来关闭警告,以抑制无关的警告消息,并最后恢复以前的设置.这将为您提供类似

I turn warnings off at the start using options(warn = -1) to suppress extraneous warning messages, and restore the previous settings at the end. This will give you an output like

# trying URL 'http://foo.com/bar.pdf'
# [1] "http://foo.com/bar.pdf did not work out"
# trying URL 'http://www.orimi.com/pdf-test.pdf'
# Content type 'application/pdf' length 20597 bytes (20 KB)
# ==================================================
# downloaded 20 KB

# trying URL 'http://bar.com/foo.pdf'
# [1] "http://bar.com/foo.pdf did not work out"

这篇关于使用R中的download.file下载时跳过错误文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 04:57