问题描述
我正在用 Python 制作 PDF 网页抓取工具.从本质上讲,我试图从我的一门课程中抓取所有讲义,这些讲义是 PDF 格式的.我想输入一个 url,然后获取 PDF 并将它们保存在我的笔记本电脑的目录中.我看过几个教程,但我不完全确定如何去做.StackOverflow 上的问题似乎也没有帮助我.
I'm working on making a PDF Web Scraper in Python. Essentially, I'm trying to scrape all of the lecture notes from one of my courses, which are in the form of PDFs. I want to enter a url, and then get the PDFs and save them in a directory in my laptop. I've looked at several tutorials, but I'm not entirely sure how to go about doing this. None of the questions on StackOverflow seem to be helping me either.
这是我目前所拥有的:
import requests
from bs4 import BeautifulSoup
import shutil
bs = BeautifulSoup
url = input("Enter the URL you want to scrape from: ")
print("")
suffix = ".pdf"
link_list = []
def getPDFs():
# Gets URL from user to scrape
response = requests.get(url, stream=True)
soup = bs(response.text)
#for link in soup.find_all('a'): # Finds all links
# if suffix in str(link): # If the link ends in .pdf
# link_list.append(link.get('href'))
#print(link_list)
with open('CS112.Lecture.09.pdf', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
print("PDF Saved")
getPDFs()
最初,我已经获得了所有 PDF 的链接,但不知道如何下载它们;代码现在被注释掉了.
Originally, I had gotten all of the links to the PDFs, but did not know how to download them; the code for that is now commented out.
现在我已经尝试只下载一个 PDF;并且确实下载了 PDF,但它是一个 0KB 文件.
Now I've gotten to the point where I'm trying to download just one PDF; and a PDF does get downloaded, but it's a 0KB file.
如果有任何用处,我使用的是 Python 3.4.2
If it's of any use, I'm using Python 3.4.2
推荐答案
如果这是不需要登录的东西,你可以使用 urlretrieve()
:
If this is something that does not require being logged in, you can use urlretrieve()
:
from urllib.request import urlretrieve
for link in link_list:
urlretrieve(link)
这篇关于如何从抓取的链接中下载 PDF [Python]?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!