问题描述
我正在用python构建一个损坏的链接检查器,这正变得越来越繁琐,它正在构建用于正确识别在使用浏览器访问时无法解析的链接的逻辑.我找到了一组链接,这些链接可以使我的刮板始终如一地重现一个重定向错误,但是在浏览器中访问时可以完美地解决.我希望可以在这里找到一些见识.
I'm building a broken link checker in python, and it's becoming a chore building the logic for correctly identifying links that do not resolve when visited with a browser. I've found a set of links where I can consistently reproduce a redirect error with my scraper, but which resolve perfectly when visited in a browser. I was hoping I could find some insight here.
import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
try:
req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
response = urllib.request.urlopen(req)
raw_response = response.read().decode('utf8', errors='ignore')
response.close()
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)
在这种情况下,可靠返回此错误的URL的示例为' http://forums.hostgator.com/want-see-your-sites-dns-propagating-t48838.html ".可以在访问时完美解析,但是上面的代码将返回以下错误:
In this instance, an example of a URL that reliably returns this error is 'http://forums.hostgator.com/want-see-your-sites-dns-propagating-t48838.html'. It resolves perfectly when visited but the code above will return the following error:
HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently
有什么想法可以正确地将这些链接识别为功能,而又不会盲目忽略该站点的链接(这可能会丢失真正损坏的链接)?
Any ideas how I can correctly identify these links as functional without blindly ignoring links from that site (which might miss genuinely broken links)?
推荐答案
您会收到无限循环错误,因为您要抓取的页面使用cookie并在客户端未发送cookie时进行重定向.当您禁用Cookie时,使用其他大多数刮板工具以及浏览器也会遇到相同的错误.
You get the infinite loop error because the page you want to scrape uses cookies and redirects when the cookie isn't sent by the client. You'll get the same error with most other scraper tools and also with browsers when you disallow cookies.
您需要http.cookiejar.CookieJar
和urllib.request.HTTPCookieProcessor
以避免重定向循环:
You need a http.cookiejar.CookieJar
and a urllib.request.HTTPCookieProcessor
to avoid the redirect loop:
import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar
try:
req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
raw_response = response.read().decode('utf8', errors='ignore')
response.close()
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)
这篇关于有效链接的urlopen返回重定向错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!