问题描述
我正在尝试以非常基本的方式抓取网站.但是 Scrapy 并没有抓取所有链接.我将解释如下场景-
I am trying to crawl sites in a very basic manner. But Scrapy isn't crawling all the links. I will explain the scenario as follows-
main_page.html -> 包含指向 a_page.html、b_page.html、c_page.html 的链接
a_page.html -> 包含指向 a1_page.html、a2_page.html
的链接b_page.html -> 包含指向 b1_page.html、b2_page.html
的链接c_page.html -> 包含指向 c1_page.html、c2_page.html
的链接a1_page.html -> 包含指向 b_page.html
的链接a2_page.html -> 包含指向 c_page.html
的链接b1_page.html -> 包含指向 a_page.html
的链接b2_page.html -> 包含指向 c_page.html
的链接c1_page.html -> 包含指向 a_page.html
的链接c2_page.html -> 包含指向 main_page.html
main_page.html -> contains links to a_page.html, b_page.html, c_page.html
a_page.html -> contains links to a1_page.html, a2_page.html
b_page.html -> contains links to b1_page.html, b2_page.html
c_page.html -> contains links to c1_page.html, c2_page.html
a1_page.html -> contains link to b_page.html
a2_page.html -> contains link to c_page.html
b1_page.html -> contains link to a_page.html
b2_page.html -> contains link to c_page.html
c1_page.html -> contains link to a_page.html
c2_page.html -> contains link to main_page.html
我在 CrawlSpider 中使用以下规则 -
I am using the following rule in CrawlSpider -
Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))
但是爬取结果如下——
DEBUG: Crawled (200) http://localhost/main_page.html> (referer:无) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer:http://localhost/main_page.html) 2011-12-05 09:56:07+0530[test_spider] 调试:爬网 (200) http://localhost/a1_page.html>(参考:http://localhost/a_page.html)2011-12-05 09:56:07+0530[test_spider] 调试:爬网 (200) http://localhost/b_page.html>(参考:http://localhost/a1_page.html)2011-12-05 09:56:07+0530[test_spider] 调试:爬网 (200) http://localhost/b1_page.html>(参考:http://localhost/b_page.html)2011-12-05 09:56:07+0530[test_spider] 信息:关闭蜘蛛(已完成)
它不会抓取所有页面.
注意 - 我已经按照 Scrapy Doc 中的说明在 BFO 中进行了爬行.
NB - I have made the crawling in BFO as it was indicated in the Scrapy Doc.
我错过了什么?
推荐答案
我今天遇到了类似的问题,尽管我使用的是自定义蜘蛛.结果是该网站限制了我的抓取,因为我的 useragent 是scrapy-bot
I had a similar problem today, although I was using a custom spider.It turned out that the website was limiting my crawl because my useragent was scrappy-bot
尝试更改您的用户代理,然后重试.将其更改为可能的已知浏览器
try changing your user agent and try again. Change it to maybe that of a known browser
您可能想尝试的另一件事是添加延迟.如果请求之间的时间太短,一些网站会阻止抓取.尝试添加 DOWNLOAD_DELAY 为 2,看看是否有帮助
Another thing you might want to try is adding a delay. Some websites prevent scraping if the time between request is too small. Try adding a DOWNLOAD_DELAY of 2 and see if that helps
有关 DOWNLOAD_DELAY 的更多信息,请访问http://doc.scrapy.org/en/0.14/topics/settings.html
More information about DOWNLOAD_DELAY athttp://doc.scrapy.org/en/0.14/topics/settings.html
这篇关于Scrapy 没有抓取所有页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!