本文介绍了为什么Scrapy会返回Iframe?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过Python-Scrapy抓取

i want to crawl this site by Python-Scrapy

我试试这个

class Parik(scrapy.Spider):
    name = "ooshop"
    allowed_domains = ["http://www.ooshop.com/courses-en-ligne/Home.aspx"]

    def __init__(self, idcrawl=None, proxy=None, *args, **kwargs):
        super(Parik, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.ooshop.com/courses-en-ligne/Home.aspx']

    def parse(self, response):
        print response.css('body').extract_first()

但我没有第一页,我有一个空的iframe

but i don't have the first page, i have an empty iframe

2016-09-06 19:09:24 [scrapy] DEBUG: Crawled (200) <GET http://www.ooshop.com/courses-en-ligne/Home.aspx> (referer: None)
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body>
2016-09-06 19:09:24 [scrapy] INFO: Closing spider (finished)


推荐答案

该网站受网站安全服务公司Incapsula的保护。它为您的浏览器提供了必须执行的挑战,然后才能获得一个特殊的cookie,让您可以访问该网站。

The website is protected by Incapsula, a website security service. It's providing your "browser" with a challenge that it must perform before being given a special cookie that gives you access to the website itself.

幸运的是,它并不难旁路。安装并安装其下载中间件:

Fortunately, it's not that hard to bypass. Install incapsula-cracker and install its downloader middleware:

DOWNLOADER_MIDDLEWARES = {
    'incapsula.IncapsulaMiddleware': 900
}

这篇关于为什么Scrapy会返回Iframe?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-16 02:58