本文介绍了为什么Scrapy会返回Iframe?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想通过Python-Scrapy抓取
i want to crawl this site by Python-Scrapy
我试试这个
class Parik(scrapy.Spider):
name = "ooshop"
allowed_domains = ["http://www.ooshop.com/courses-en-ligne/Home.aspx"]
def __init__(self, idcrawl=None, proxy=None, *args, **kwargs):
super(Parik, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.ooshop.com/courses-en-ligne/Home.aspx']
def parse(self, response):
print response.css('body').extract_first()
但我没有第一页,我有一个空的iframe
but i don't have the first page, i have an empty iframe
2016-09-06 19:09:24 [scrapy] DEBUG: Crawled (200) <GET http://www.ooshop.com/courses-en-ligne/Home.aspx> (referer: None)
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body>
2016-09-06 19:09:24 [scrapy] INFO: Closing spider (finished)
推荐答案
该网站受网站安全服务公司Incapsula的保护。它为您的浏览器提供了必须执行的挑战,然后才能获得一个特殊的cookie,让您可以访问该网站。
The website is protected by Incapsula, a website security service. It's providing your "browser" with a challenge that it must perform before being given a special cookie that gives you access to the website itself.
幸运的是,它并不难旁路。安装并安装其下载中间件:
Fortunately, it's not that hard to bypass. Install incapsula-cracker and install its downloader middleware:
DOWNLOADER_MIDDLEWARES = {
'incapsula.IncapsulaMiddleware': 900
}
这篇关于为什么Scrapy会返回Iframe?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!