本文介绍了使用 Python 从网站下载所有 pdf 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我已经按照几个在线指南尝试构建一个脚本,该脚本可以识别和下载网站上的所有 pdf,以避免我手动执行.到目前为止,这是我的代码:

I have followed several online guides in an attempt to build a script that can identify and download all pdfs from a website to save me from doing it manually. Here is my code so far:

from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib

# connect to website and get list of all pdfs
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(.pdf)'))

# clean the pdf link names
url_list = []
for el in links:
    url_list.append(("http://www.gatsby.ucl.ac.uk/teaching/courses/" + el['href']))

# download the pdfs to a specified location
for url in url_list:
    fullfilename = os.path.join('E:webscraping', url.replace("http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016/", "").replace(".pdf",""))
    request.urlretrieve(url, fullfilename)

代码似乎可以找到所有的 pdf(取消注释 print(url_list) 以查看此内容).但是,它在下载阶段失败.特别是我收到了这个错误,我无法理解出了什么问题:

The code can appear to find all the pdfs (uncomment the print(url_list) to see this). However, it fails at the download stage. In particular I get this error and I am not able to understand what's gone wrong:

E:webscraping>python get_pdfs.py
Traceback (most recent call last):
  File "get_pdfs.py", line 26, in <module>
    request.urlretrieve(url, fullfilename)
  File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 532, in open
    response = meth(req, response)
  File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 570, in error
    return self._call_chain(*args)
  File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 504, in _call_chain
    result = func(*args)
  File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found



查看以下实现.我使用 requests 模块而不是 urllib 来进行下载.此外,我使用了 .select() 方法而不是 .find_all() 来避免使用 re.

Check out the following implementation. I've used requests module instead of urllib to do the download. Moreover, I've used .select() method instead of .find_all() to avoid using re.

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html"

#If there is no such folder, the script will create one automatically
folder_location = r'E:webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:

这篇关于使用 Python 从网站下载所有 pdf 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 09:25