本文介绍了在 Scrapy 蜘蛛中动态添加到 allowed_domains的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个爬虫,它在爬虫开始时以一小部分 allowed_domains
开头.我需要在解析器中继续进行爬取时,将更多域动态添加到此白名单中,但由于后续请求仍在过滤中,因此以下代码无法完成.解析器中是否还有更新 allowed_domains
的内容?
I have a spider that starts with a small list of allowed_domains
at the beginning of the spidering. I need to add more domains dynamically to this whitelist as the spidering continues from within a parser, but the following piece of code does not get that accomplished since subsequent requests are still being filtered. Is there another of updating allowed_domains
within the parser?
class APSpider(BaseSpider):
name = "APSpider"
allowed_domains = ["www.somedomain.com"]
start_urls = [
"http://www.somedomain.com/list-of-websites",
]
...
def parse(self, response):
soup = BeautifulSoup( response.body )
for link_tag in soup.findAll('td',{'class':'half-width'}):
_website = link_tag.find('a')['href']
u = urlparse.urlparse(_website)
self.allowed_domains.append(u.netloc)
yield Request(url=_website, callback=self.parse_secondary_site)
...
推荐答案
您可以尝试以下操作:
class APSpider(BaseSpider):
name = "APSpider"
start_urls = [
"http://www.somedomain.com/list-of-websites",
]
def __init__(self):
self.allowed_domains = None
def parse(self, response):
soup = BeautifulSoup( response.body )
if not self.allowed_domains:
for link_tag in soup.findAll('td',{'class':'half-width'}):
_website = link_tag.find('a')['href']
u = urlparse.urlparse(_website)
self.allowed_domains.append(u.netloc)
yield Request(url=_website, callback=self.parse_secondary_site)
if response.url in self.allowed_domains:
yield Request(...)
...
这篇关于在 Scrapy 蜘蛛中动态添加到 allowed_domains的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!