1.Scrapy是什么
Scrapy是基于twisted的爬虫框架,用户定制开发几个模块就可以实现爬虫
2.Scrapy的优势
没有Scrapy要自己手写爬虫的时候,我们要用Urlib或Requests库发送请求、封装http头部信息类、多线程、封装代理类、封装去重类、封装数据存储类、封装去重类、封装异常检测机制
3.Scrapy架构
Scrapy Engine:Scrapy的引擎。它负责Scheduler,Pipeline,Spiders,Downloader之间的信号、消息和通讯传递
Scheduler:Scrapy的调度器。简单地说是队列,接受Scrapy Engine发送来的Request,Scheduler对它们进行排队,当Scrapy Engine需要数据时,Scheduler将请求队列中的数据传送给引擎
Downloader:Scrapy的下载器。它负责接受Scrapy Engine的Request,生成Response,并将其交还给Scrapy Engine,引擎再将Response交给Spiders
Spiders:Scrapy的爬虫。它用来写爬虫逻辑,如编写正则,BeautifulSoup,Xpath等;如果Response包含下一次请求,如“下一页”,Spiders会将URL交给Scrapy Engine,再有引擎交给Scheduler进行排队
Pipeline:Scrapy的管道。封装去重类、存储类的地方,负责数据的后期过滤、存储等
Downloader:下载器。它负责发送请求并下载数据
Downloader Middlewares:下载中间件。自定义扩展组件,是我们封装代理、封装HTTP头的地方
Spider Middlewares:爬虫中间件。可以封装从Spiders发送出去的Request和接受到的Response
4.Scrapy例子
4.1 爬取豆瓣电影Top250
搭建Scapy项目的教程网上有很多,可以自行百度
自定义代理中间件,这里用到了本地Ip代理,大量爬虫请求的话需要接入第三方代理工具。可以将爬取源Ip伪装成如下代理
class specified_proxy(object): def proccess_request(self,request,spider): #随机选取代理Ip PROXIES = ['http://183.207.95.27:80', 'http://111.6.100.99:80', 'http://122.72.99.103:80', 'http://106.46.132.2:80', 'http://112.16.4.99:81', 'http://123.58.166.113:9000', 'http://118.178.124.33:3128', 'http://116.62.11.138:3128', 'http://121.42.176.133:3128', 'http://111.13.2.131:80', 'http://111.13.7.117:80', 'http://121.248.112.20:3128', 'http://112.5.56.108:3128', 'http://42.51.26.79:3128', 'http://183.232.65.201:3128', 'http://118.190.14.150:3128', 'http://123.57.221.41:3128', 'http://183.232.65.203:3128', 'http://166.111.77.32:3128', 'http://42.202.130.246:3128', 'http://122.228.25.97:8101', 'http://61.136.163.245:3128', 'http://121.40.23.227:3128', 'http://123.96.6.216:808', 'http://59.61.72.202:8080', 'http://114.141.166.242:80', 'http://61.136.163.246:3128', 'http://60.31.239.166:3128', 'http://114.55.31.115:3128', 'http://202.85.213.220:3128'] random_proxy = random.sample(PROXIES, 1) request.meta['proxy'] = random_proxy
自定义user_agent,让目标服务器知道我们不是机器,而是从操作系统、浏览器等发出的请求
class specified_useragent(object): def proccess_request(self, request, spider): #随机选取user_agent USER_AGENT_LIST = [ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", ] agent = random.choice(USER_AGENT_LIST) request.headers['USER_AGNET'] = agent
配置完自定义中间件,要在Settings.py中引用它们
#数字越小优先级越高 DOWNLOADER_MIDDLEWARES = {'ScrapyTest.middlewares.specified_proxy': 543, 'ScrapyTest.middlewares.specified_useragent': 544 }
在items.py里定义数据
import scrapy class ScrapytestItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #电影序号 serial_number = scrapy.Field(); #电影名称 movie_name = scrapy .Field(); #电影介绍 introduce = scrapy.Field(); #评分 star = scrapy.Field(); #电影的评论数 evaluate = scrapy.Field(); #电影描述 describe = scrapy.Field(); pass
在管道pipelines.py中配置数据的存储,连接Monodb
class ScrapytestPipeline(object): def __init__(self): host = monodb_host port = monodb_port dbname = monodb_db_name sheetname = monodb_tb_name client = pymongo.MongoClient(host=host,port=port) mydb = client[dbname] self.post = mydb[sheetname] def process_item(self, item, spider): data = dict(item) self.post.insert(data) return item
settings.py数据库信息
monodb_host = "127.0.0.1" monodb_port = 27017 monodb_db_name = "scrapy_test" monodb_tb_name = "douban_movie"
运行main后的效果
在Mongodb数据库中可以看到插入进来的数据
use scrapy_test;
show collections;
db.douban_movie.find().pretty()