识别User Agent屏蔽一些Web爬虫防采集
from:https://jamesqi.com/%E5%8D%9A%E5%AE%A2/%E8%AF%86%E5%88%ABUser_Agent%E5%B1%8F%E8%94%BD%E4%B8%80%E4%BA%9BWeb%E7%88%AC%E8%99%AB%E9%98%B2%E9%87%87%E9%9B%86
自从做网站以来,大量自动抓取我们内容的爬虫一直是个问题,防范采集是个长期任务,这篇是我5年前的博客文章:《Apache中设置屏蔽IP地址和URL网址来禁止采集》,另外,还可以识别User Agent来辨别和屏蔽一些采集者,在Apache中设置的代码例子如下:
RewriteCond %{HTTP_USER_AGENT} ^(.*)(DTS\sAgent|Creative\sAutoUpdate|HTTrack|YisouSpider|SemrushBot)(.*)$
RewriteRule .* - [F,L]
屏蔽User Agent为空的代码:
RewriteCond %{HTTP_USER_AGENT} ^$
RewriteRule .* - [F]
屏蔽Referer和User Agent都为空的代码:
RewriteCond %{HTTP_REFERER} ^$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^$ [NC]
RewriteRule .* - [F]
下面把一些可以屏蔽的常见采集软件或者机器爬虫的User Agent的特征关键词列一下供参考:
- User-Agent
- DTS Agent
- HttpClient
- Owlin
- Kazehakase
- Creative AutoUpdate
- HTTrack
- YisouSpider
- baiduboxapp
- Python-urllib
- python-requests
- SemrushBot
- SearchmetricsBot
- MegaIndex
- Scrapy
- EMail Exractor
- 007ac9
ltx71
其它也可以考虑屏蔽的:
- Mail.RU_Bot:http://go.mail.ru/help/robots
- Feedly
- ZumBot
- Pcore-HTTP
- Daum
- your-server
- Mobile/12A4345d
- PhantomJS/2.1.1
- archive.org_bot
- AcooBrowser
- Go-http-client
- Jakarta Commons-HttpClient
- Apache-HttpClient
- BDCbot
- ECCP
- Nutch
- cr4nk
- MJ12bot
- MOT-MPx220
- Y!OASIS/TEST
- libwww-perl
一般不要屏蔽的主流搜索引擎特征:
- Baidu
- Yahoo
- Slurp
- yandex
- YandexBot
MSN
一些常见浏览器或者通用代码也不要轻易屏蔽:
- FireFox
- Apple
- PC
- Chrome
- Microsoft
- Android
- Windows
- Mozilla
- Safar
- Macintosh