我试图从亚马逊上搜刮一个类别,但我在搜刮中得到的链接与浏览器中的不同。现在我试着沿着下一页的轨迹,在Scrapy中(打印到一个txt文件中)我看到了这些链接:
<span class="pagnMore">...</span>
<span class="pagnLink"><a href="/s?ie=UTF8&page=4&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011" >4</a></span>
<span class="pagnCur">5</span>
<span class="pagnLink"><a href="/s?ie=UTF8&page=6&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011" >6</a></span>
<span class="pagnMore">...</span>
<span class="pagnDisabled">20</span>
<span class="pagnRA"> <a title="Next Page"
id="pagnNextLink"
class="pagnNext"
href="/s?ie=UTF8&page=6&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011">
<span id="pagnNextString">Next Page</span>
我想跟随pagnNextString链接,但我的蜘蛛甚至没有开始爬行:
Rule(SgmlLinkExtractor(allow=("n\%3A2619533011\%", ),restrict_xpaths=('//*[@id="pagnNextLink"]',)) , callback="parse_items", follow= True),
如果我不遵守规则或做类似的事情,它是有效的,但它遵循一切。
我在这里做错什么了?
最佳答案
尝试仅检查page
参数:
Rule(SgmlLinkExtractor(allow=r"page=\d+"), callback="parse_items", follow= True),