Scrapy爬取网页基本概念
Scrapy爬取网页基本概念
怎么样用Scrapy生成project?
scrapy startproject xxx
如何用Scrapy爬取网页?
import scrapy
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request
from scrapy.selector import Selector
xxx=selector.xpath(xxxxx).extract()
Scrapy的文件结构
Project中包含:
- items.py
- settings.py
- pipelines.py
1. items.py
Item objects are simple containers used to collect the scraped data. They provide a dictionary-like API with a convenient syntax for declaring their available fields.——Scrapy官方手册
items.py定义需要抓取并需要后期处理的数据
2. settings.py
The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.——Scrapy官方手册
settings.py文件配置Scrapy,从而修改user-agent,设定爬取时间间隔,设置代理,配置各种中间件等等
3. pipelines.py
After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.——Scrapy官方手册
pipelines.py用于存放执行后期数据处理的功能,从而使得数据的爬取和处理分开。