Scrapy linkextractor 忽略符号 # 后面的参数，因此不会跟随链接

本文介绍了Scrapy linkextractor 忽略符号 # 后面的参数，因此不会跟随链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用scrapy抓取一个网站，其中分页位于符号#"后面.这以某种方式使scrapy 忽略该字符后面的所有内容，并且它始终只会看到第一页.

I am trying to crawl a website with scrapy where the pagination is behind the sign "#". This somehow makes scrapy ignore everything behind that character and it will always only see the first page.

例如:

http://www.rolex.de/de/watches/find-rolex.html#g=1&p=2

如果您手动输入问号，网站将加载第 1 页

If you enter a question mark manually, the site will load page 1

http://www.rolex.de/de/watches/find-rolex.html?p=2

scrapy 的统计数据告诉我它获取了第一页:

The stats from scrapy tell me it fetched the first page:

DEBUG: Crawled (200) http://www.rolex.de/de/watches/datejust/m126334-0014.html>(参考:http://www.rolex.de/de/watches/find-rolex.html)

我的爬虫看起来像这样:

My crawler looks like this:

start_urls = [
    'http://www.rolex.de/de/watches/find-rolex.html#g=1',
    'http://www.rolex.de/de/watches/find-rolex.html#g=0&p=2',
    'http://www.rolex.de/de/watches/find-rolex.html#g=0&p=3',
]

rules = (
    Rule(
        LinkExtractor(allow=['.*/de/watches/.*/m\d{3,}.*.\.html']),
        callback='parse_item'
    ),
    Rule(
        LinkExtractor(allow=['.*/de/watches/find-rolex(/.*)?\.html#g=1(&p=\d*)?$']),
        follow=True
    ),
)

如何让scrapy 忽略url 中的# 并访问给定的URL?

How can I make scrapy ignore the # inside the url and visit the given URL?

因此

Scrapy linkextractor 忽略符号 # 后面的参数，因此不会跟随链接

问题描述

推荐答案