问题描述
我是python和scrapy的新手.我使用了这个博客中的方法以编程方式运行多个scrapy蜘蛛来运行我的烧瓶应用程序中的蜘蛛.这是代码:
I am a newbie of python and scrapy. I used the method in this blog Running multiple scrapy spiders programmatically to run my spiders in a flask app.Here is the code:
# list of crawlers
TO_CRAWL = [DmozSpider, EPGDspider, GDSpider]
# crawlers that are running
RUNNING_CRAWLERS = []
def spider_closing(spider):
"""
Activates on spider closed signal
"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
reactor.stop()
# start logger
log.start(loglevel=log.DEBUG)
# set up the crawler and start to crawl one spider at a time
for spider in TO_CRAWL:
settings = Settings()
# crawl responsibly
settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
crawler = Crawler(settings)
crawler_obj = spider()
RUNNING_CRAWLERS.append(crawler_obj)
# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()
# blocks process; so always keep as the last statement
reactor.run()
这是我的蜘蛛代码:
class EPGDspider(scrapy.Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
MONGODB_DB = name + "_" + term
MONGODB_COLLECTION = name + "_" + term
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
url_list = []
base_url = "http://epgd.biosino.org/EPGD"
for site in sites:
item = EPGD()
item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
yield item
sel_tmp = Selector(response)
link = sel_tmp.xpath('//span[@id="quickPage"]')
for site in link:
url_list.append(site.xpath('a/@href').extract())
for i in range(len(url_list[0])):
if cmp(url_list[0][i], "#") == 0:
if i+1 < len(url_list[0]):
print url_list[0][i+1]
actual_url = "http://epgd.biosino.org/EPGD/search/"+ url_list[0][i+1]
yield Request(actual_url, callback=self.parse)
break
else:
print "The index is out of range!"
如您所见,我的代码中有一个参数 term = 'man'
,它是我的 start urls
的一部分.我不想固定这个参数,所以我想知道如何在我的程序中动态地给出start url
或参数term
?就像在命令行中运行蜘蛛一样,有一种方法可以传递参数,如下所示:
As you can see, there is a parameter term = 'man'
in my code, and it's part of my start urls
. I don't want this parameter to be fixed, so I wonder how can I give the start url
or the parameter term
dynamically in my program? Just like running a spider in command line there is a way can pass parameter as below:
class MySpider(BaseSpider):
name = 'my_spider'
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
And start it like: scrapy crawl my_spider -a start_url="http://some_url"
谁能告诉我如何处理这个问题?
Can anybody tell me how to deal with this?
推荐答案
首先,要在一个脚本中运行多个蜘蛛,推荐的方式是使用 scrapy.crawler.CrawlerProcess
, 在这里传递蜘蛛类而不是蜘蛛实例.
First of all, to run multiple spiders in a script, the recommended way is to use scrapy.crawler.CrawlerProcess
, where you pass spider classes and not spider instances.
要使用 CrawlerProcess
将参数传递给您的蜘蛛,您只需将参数添加到 .crawl()
调用中,在蜘蛛子类之后,例如
To pass arguments to your spider with CrawlerProcess
, you just have to add the arguments to the .crawl()
call, after the spider subclass,e.g.
process.crawl(DmozSpider, term='someterm', someotherterm='anotherterm')
以这种方式传递的参数然后可用作蜘蛛属性(与命令行上的 -a term=someterm
相同)
Arguments passed this way are then available as spider attributes (same as with -a term=someterm
on the command line)
最后,您可以使用 ,您可以使用 self.term
:
Finally, instead of building start_urls
in __init__
, you can achieve the same with start_requests
, and you can build initial requests like this, using self.term
:
def start_requests(self):
yield Request("http://epgd.biosino.org/"
"EPGD/search/textsearch.jsp?"
"textquery={}"
"&submit=Feeling+Lucky".format(self.term))
这篇关于如何在程序中将参数传递给scrapy蜘蛛?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!