Scrapy爬取网页基本概念

Scrapy爬取网页基本概念

怎么样用Scrapy生成project?

scrapy startproject xxx

如何用Scrapy爬取网页?

import scrapy
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request
from scrapy.selector import Selector

xxx=selector.xpath(xxxxx).extract()

Scrapy的文件结构

Project中包含:

  • items.py
  • settings.py
  • pipelines.py

1. items.py


Item objects are simple containers used to collect the scraped data. They provide a dictionary-like API with a convenient syntax for declaring their available fields.——Scrapy官方手册


items.py定义需要抓取并需要后期处理的数据

2. settings.py


The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.——Scrapy官方手册


settings.py文件配置Scrapy,从而修改user-agent,设定爬取时间间隔,设置代理,配置各种中间件等等

3. pipelines.py


After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.——Scrapy官方手册


pipelines.py用于存放执行后期数据处理的功能,从而使得数据的爬取和处理分开。

04-30 04:39