问题描述
我正在尝试从类似此以建立联合国决议库.
I am trying to automatically download PDFs from URLs like this to make a library of UN resolutions.
如果我使用漂亮的汤或机械化打开该URL,则会收到您的浏览器不支持框架"的信息,如果在chrome开发工具中使用复制为curl"功能,也会得到相同的结果.
If I use beautiful soup or mechanize to open that URL, I get "Your browser does not support frames" -- and I get the same thing if I use the copy as curl feature in chrome dev tools.
使用机械化或精美的汤时,您的浏览器不支持框架"的标准建议是跟踪每个框架的来源并加载该框架.但是,如果这样做,我会收到一条错误消息,指出该页面不是已授权.
The standard advice for the "Your browser does not support frames" when using mechanize or beautiful soup is to follow the source of each individual frame and load that frame. But if I do so, I get to an error message that the page is not authorized.
我该如何进行?我想我可以用僵尸或幻象来尝试这种方法,但是我不愿意使用那些工具,因为我对它们并不熟悉.
How can I proceed? I guess I could try this in zombie or phantom but I would prefer to not use those tools as I am not that familiar with them.
推荐答案
好吧,这是与 requests
和 BeautifulSoup
Okay, this was an interesting task to do with requests
and BeautifulSoup
.
有一系列对un.org
和daccess-ods.un.org
的基础调用,这些调用很重要并设置了相关的cookie.这就是为什么您需要维护 requests.Session()
并在访问pdf之前先访问几个URL.
There is a set of underlying calls to un.org
and daccess-ods.un.org
that are important and set relevant cookies. This is why you need to maintain requests.Session()
and visit several urls before getting access to the pdf.
这是完整的代码:
import re
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://www.un.org/en/ga/search/'
URL = "http://www.un.org/en/ga/search/view_doc.asp?symbol=A/RES/68/278"
BASE_ACCESS_URL = 'http://daccess-ods.un.org'
# start session
session = requests.Session()
response = session.get(URL, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
# get frame links
soup = BeautifulSoup(response.text)
frames = soup.find_all('frame')
header_link, document_link = [urljoin(BASE_URL, frame.get('src')) for frame in frames]
# get header
session.get(header_link, headers={'Referer': URL})
# get document html url
response = session.get(document_link, headers={'Referer': URL})
soup = BeautifulSoup(response.text)
content = soup.find('meta', content=re.compile('URL='))['content']
document_html_link = re.search('URL=(.*)', content).group(1)
document_html_link = urljoin(BASE_ACCESS_URL, document_html_link)
# follow html link and get the pdf link
response = session.get(document_html_link)
soup = BeautifulSoup(response.text)
# get the real document link
content = soup.find('meta', content=re.compile('URL='))['content']
document_link = re.search('URL=(.*)', content).group(1)
document_link = urljoin(BASE_ACCESS_URL, document_link)
print document_link
# follow the frame link with login and password first - would set the important cookie
auth_link = soup.find('frame', {'name': 'footer'})['src']
session.get(auth_link)
# download file
with open('document.pdf', 'wb') as handle:
response = session.get(document_link, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
您可能应该将单独的代码块提取到函数中,以使其更具可读性和重用性.
You should probably extract separate blocks of code into functions to make it more readable and reusable.
仅供参考,在 selenium
Ghost.py
的a>.
FYI, all of this could be more easily done through the real browser with the help of selenium
of Ghost.py
.
希望有帮助.
这篇关于如果浏览器不支持框架+无法直接访问框架,如何自动获取框架的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!