本文介绍了使用 Scrapy 从分页页面中提取 3 级内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个没有分页的种子网址(比如 DOMAIN/manufacturers.php
),如下所示:
<头><title></title><身体><div class="st-text"><table cellpacing="6" width="600"><tr><td><a href="manufacturer1-type-59.php"></a></td><td><a href="manufacturer1-type-59.php">名称 1</a></td><td><a href="manufacturer2-type-5.php"></a></td><td><a href="manufacturer2-type-5.php">名称 2</a></td></tr><tr><td><a href="manufacturer3-type-88.php"></a></td><td><a href="manufacturer3-type-88.php">名称3</a></td><td><a href="manufacturer4-type-76.php"></a></td><td><a href="manufacturer4-type-76.php">名称4</a></td></tr><tr><td><a href="manufacturer5-type-28.php"></a></td><td><a href="manufacturer5-type-28.php">名称 5</a></td><td><a href="manufacturer6-type-48.php"></a></td><td><a href="manufacturer6-type-48.php">名称6</a></td></tr>
</html>
从那里我想得到所有 a['href'] 's
,例如:manufacturer1-type-59.php
.请注意,这些链接已经不包含 DOMAIN
前缀,所以我的猜测是我必须以某种方式添加它,或者可能不添加?
或者,我希望将链接保存在 memory
(用于下一阶段),并将它们保存到 disk
以备将来参考.
每个链接的内容,例如manufacturer1-type-59.php
,如下所示:
<头><title></title><身体><div class="制造商"><ul><li><a href="manufacturer1_model1_type1.php"></a><li><a href="manufacturer1_model1_type2.php"></a><li><a href="manufacturer1_model2_type3.php"></a>
<div class="nav-band"><div class="nav-items"><div class="nav-pages"><span>页面:</span><strong>1</strong><a href="manufacturer1-type-STRING-59-INT-p2.php">2</a><a href="manufacturer1-type-STRING-59-INT-p3.php">3</a><a href="manufacturer1-type-STRING-59-INT-p2.php" title="下一页">»</a>
</html>
接下来,我想获取所有a['href'] 's
,例如manufacturer_model1_type1.php
.再次注意,这些链接不包含域前缀.这里的另一个困难是这些页面支持分页.所以,我也想进入所有这些页面.正如预期的那样,manufacturer-type-59.php
重定向到 manufacturer-type-STRING-59-INT-p2.php
.
或者,我还想将链接保存在 memory
(用于下一阶段),并将它们保存到 disk
以备将来参考.
第三步也是最后一步应该是检索manufacturer_model1_type1.php
类型的所有页面的内容,提取标题,并将结果以如下形式保存在一个文件中:(url, title,).
编辑
这是我到目前为止所做的但似乎不起作用...
导入scrapy从 scrapy.contrib.spiders 导入 CrawlSpider,规则从 scrapy.contrib.linkextractors 导入 LinkExtractor类 ArchiveItem(scrapy.Item):url = scrapy.Field()类 ArchiveSpider(CrawlSpider):名称 = 'gsmarena'allowed_domains = ['gsmarena.com']start_urls = ['http://www.gsmarena.com/makers.php3']规则 = [规则(LinkExtractor(allow=['\S+-phones-\d+\.php'])),规则(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),规则(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),]def parse_archive(self, response):洪流 = ArchiveItem()torrent['url'] = response.url返回洪流
解决方案
我认为你最好使用 BaseSpider 而不是 CrawlSpider
此代码可能有帮助
class GsmArenaSpider(Spider):名称 = 'gsmarena'start_urls = ['http://www.gsmarena.com/makers.php3', ]allowed_domains = ['gsmarena.com']BASE_URL = 'http://www.gsmarena.com/'定义解析(自我,响应):标记 = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract()如果标记:对于标记中的标记:产量请求(url=self.BASE_URL + 标记,回调=self.parse_marker)def parse_marker(self, response):url = response.url# 提取手机网址phone = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract()如果电话:对于电话中的电话:# 将回调函数名称更改为 parse_events 以进行第一次抓取产量请求(url=self.BASE_URL + 电话,回调=self.parse_phone)别的:返回# 分页next_page = response.xpath('//a[contains(@title, "下一页")]/@href').extract()如果下一页:产量请求(url=self.BASE_URL + next_page[0], callback=self.parse_marker)def parse_phone(self, response):# 提取任何你想要的东西并在这里产生项目经过
编辑
如果您想跟踪这些电话网址的来源,您可以将网址作为 meta 从 parse 传递给 parse_phone 通过parse_marker然后请求看起来像
yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url})产量请求(url=self.BASE_URL + 电话,回调=self.parse_phone,meta={'url_level2': response.url, url_level1: response.meta['url_level1']})
I have a seed url (say DOMAIN/manufacturers.php
) with no pagination that looks like this:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div class="st-text">
<table cellspacing="6" width="600">
<tr>
<td>
<a href="manufacturer1-type-59.php"></a>
</td>
<td>
<a href="manufacturer1-type-59.php">Name 1</a>
</td>
<td>
<a href="manufacturer2-type-5.php"></a>
</td>
<td>
<a href="manufacturer2-type-5.php">Name 2</a>
</td>
</tr>
<tr>
<td>
<a href="manufacturer3-type-88.php"></a>
</td>
<td>
<a href="manufacturer3-type-88.php">Name 3</a>
</td>
<td>
<a href="manufacturer4-type-76.php"></a>
</td>
<td>
<a href="manufacturer4-type-76.php">Name 4</a>
</td>
</tr>
<tr>
<td>
<a href="manufacturer5-type-28.php"></a>
</td>
<td>
<a href="manufacturer5-type-28.php">Name 5</a>
</td>
<td>
<a href="manufacturer6-type-48.php"></a>
</td>
<td>
<a href="manufacturer6-type-48.php">Name 6</a>
</td>
</tr>
</table>
</div>
</body>
</html>
From there I would like to get all a['href'] 's
, for example: manufacturer1-type-59.php
. Note that these links do NOT contain the DOMAIN
prefix already so my guess is that I have to add it somehow, or maybe not?
Optionally, I would like to keep the links both in memory
(for the very next phase) and also save them to disk
for future reference.
The content of each of these links, such as manufacturer1-type-59.php
, looks like this:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div class="makers">
<ul>
<li>
<a href="manufacturer1_model1_type1.php"></a>
</li>
<li>
<a href="manufacturer1_model1_type2.php"></a>
</li>
<li>
<a href="manufacturer1_model2_type3.php"></a>
</li>
</ul>
</div>
<div class="nav-band">
<div class="nav-items">
<div class="nav-pages">
<span>Pages:</span><strong>1</strong>
<a href="manufacturer1-type-STRING-59-INT-p2.php">2</a>
<a href="manufacturer1-type-STRING-59-INT-p3.php">3</a>
<a href="manufacturer1-type-STRING-59-INT-p2.php" title="Next page">»</a>
</div>
</div>
</div>
</body>
</html>
Next, I would like to get all a['href'] 's
, for example manufacturer_model1_type1.php
. Again, note that these links do NOT contain the domain prefix. One additional difficulty here is that these pages support pagination. So, I would like to go into all these pages too. As expected, manufacturer-type-59.php
redirects to manufacturer-type-STRING-59-INT-p2.php
.
Optionally, I would also like to keep the links both in memory
(for the very next phase) and also save them to disk
for future reference.
The third and final step should be to retrieve the content of all pages of type manufacturer_model1_type1.php
, extract the title, and save result in a file in the following form: (url, title, ).
EDIT
This is what I have done so far but doesn't seem to work...
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ArchiveItem(scrapy.Item):
url = scrapy.Field()
class ArchiveSpider(CrawlSpider):
name = 'gsmarena'
allowed_domains = ['gsmarena.com']
start_urls = ['http://www.gsmarena.com/makers.php3']
rules = [
Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])),
Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),
Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),
]
def parse_archive(self, response):
torrent = ArchiveItem()
torrent['url'] = response.url
return torrent
解决方案
I think you better use BaseSpider instead of CrawlSpider
this code might help
class GsmArenaSpider(Spider):
name = 'gsmarena'
start_urls = ['http://www.gsmarena.com/makers.php3', ]
allowed_domains = ['gsmarena.com']
BASE_URL = 'http://www.gsmarena.com/'
def parse(self, response):
markers = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract()
if markers:
for marker in markers:
yield Request(url=self.BASE_URL + marker, callback=self.parse_marker)
def parse_marker(self, response):
url = response.url
# extracting phone urls
phones = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract()
if phones:
for phone in phones:
# change callback function name as parse_events for first crawl
yield Request(url=self.BASE_URL + phone, callback=self.parse_phone)
else:
return
# pagination
next_page = response.xpath('//a[contains(@title, "Next page")]/@href').extract()
if next_page:
yield Request(url=self.BASE_URL + next_page[0], callback=self.parse_marker)
def parse_phone(self, response):
# extract whatever stuffs you want and yield items here
pass
EDIT
if you want to keep the track of from where these phone url's are coming, you could pass the url as meta from parse to parse_phone through parse_markerthen the request will look like
yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url})
yield Request(url=self.BASE_URL + phone, callback=self.parse_phone, meta={'url_level2': response.url, url_level1: response.meta['url_level1']})
这篇关于使用 Scrapy 从分页页面中提取 3 级内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!