问题描述
我制作了一个Web爬网程序,它可以获取到页面第一级的所有链接,并从中获取所有链接和文本以及图像链接和alt.这是完整的代码:
I have made a web crawler which gets all links till the 1st level of page and from them it gets all link and text plus imagelinks and alt. here is whole code:
import urllib
import re
import time
from threading import Thread
import MySQLdb
import mechanize
import readability
from bs4 import BeautifulSoup
from readability.readability import Document
import urlparse
url = ["http://sparkbrowser.com"]
i=0
while i<len(url):
counterArray = [0]
levelLinks = []
linkText = ["homepage"]
levelLinks = []
def scraper(root,steps):
urls = [root]
visited = [root]
counter = 0
while counter < steps:
step_url = scrapeStep(urls)
urls = []
for u in step_url:
if u not in visited:
urls.append(u)
visited.append(u)
counterArray.append(counter +1)
counter +=1
levelLinks.append(visited)
return visited
def scrapeStep(root):
result_urls = []
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
for url in root:
try:
br.open(url)
for link in br.links():
newurl = urlparse.urljoin(link.base_url, link.url)
result_urls.append(newurl)
#levelLinks.append(newurl)
except:
print "error"
return result_urls
scraperOut = scraper(url[i],1)
for sl,ca in zip(scraperOut,counterArray):
print "\n\n",sl," Level - ",ca,"\n"
#Mechanize
br = mechanize.Browser()
page = br.open(sl)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
#BeautifulSoup
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)
for linkins in br.links(text_regex=re.compile('^((?!IMG).)*$')):
newesturl = urlparse.urljoin(linkins.base_url, linkins.url)
linkTxt = linkins.text
print newesturl,linkTxt
for linkwimg in soup.find_all('a', attrs={'href': re.compile("^http://")}):
imgSource = linkwimg.find('img')
if linkwimg.find('img',alt=True):
imgLink = linkwimg['href']
#imageLinks.append(imgLink)
imgAlt = linkwimg.img['alt']
#imageAlt.append(imgAlt)
print imgLink,imgAlt
elif linkwimg.find('img',alt=False):
imgLink = linkwimg['href']
#imageLinks.append(imgLink)
imgAlt = ['No Alt']
#imageAlt.append(imgAlt)
print imgLink,imgAlt
i+=1
一切正常,直到我的搜寻器达到他无法读取的facebook links
之一,但他给我错误
Everything is working great until my crawler reaches one of facebook links
which he can't read, but he gives me error
httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
第68行,即:page = br.open(sl)
现在我不为什么了,因为如您所见,我已经设置了set_handle_robots
和add_headers
选项的机械化.
And I don't now why because as you can see, I've setted up mechanize set_handle_robots
and add_headers
options.
我不知道为什么,但是我注意到我在facebook
链接(在这种情况下是facebook.com/sparkbrowser
和google to)中遇到了该错误.
I don't know why is that but I noticed that I'm getting that error for facebook
links, in this case facebook.com/sparkbrowser
and google to.
欢迎任何帮助或建议.
欢呼
推荐答案
好,所以这个问题出现了相同的问题:
Ok, so the same problem appeared in this question:
通过发送普通浏览器将发送的所有请求标头,并接受/发送回服务器发送的cookie即可解决此问题.
By sending all the request headers a normal browser would send, and accepting / sending back the cookies the server sends should resolve the issue.
这篇关于Python,机械化-即使在set_handle_robots和add_headers之后,robots.txt仍不允许该请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!