本文介绍了Beautifulsoup并以哈希号链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将Beautifulsoup与Python结合使用.我尝试从包含哈希号的链接中获取元素.这是一个分页链接,#后面的部分是页码.

I'm using Beautifulsoup with Python.I try to get elements from a link containing a hash #. It's a pagination link, the part after the # is the page number.

这是行不通的,我理解问题是因为urllib2无法处理此问题,因为#后面的URL部分是用于客户端处理的,并且永远不会发送到服务器.

It doesn't work, I understood the problem is because urllib2 can't handle this since the part of the URL after the # is for client side handling and is never send to the server.

因此,我使用Chrome开发人员工具的网络"标签检查了真实网址,并为我提供了这一点:

So I checked the real URL using the network tab of the developer tools in Chrome and it gives me this :

http://www.myserver.com/modules/blocklayered/blocklayered-ajax.php?_=1486617675431&id_category_layered=24&layered_weight_slider=0_10&layered_price_slider= 21_2991& orderby = position& orderway = desc& n = 20& p = 3

服务器似乎根本不喜欢此URL,因为它返回了一个仅包含以下奇怪结果的空白页面: {"filtersBlock":"\ n \ n

It looks like the server doesn't like this URL at all because it returns me a blank page containing only this weird result : {"filtersBlock":"\n\n

所以我的问题是,有没有办法用BeautifulSoup处理这类链接?

So my question is, is there a way to handle these kind of link with BeautifulSoup ?

推荐答案

我找到了一种使用BeautifulSoup抓取DOM和Selenium来处理包含#的链接的方法.只需通过 driver.get("www.myserver.com/products#/page-2")将包含#的链接传递到Selenium驱动程序即可.

I found a way doing this using BeautifulSoup to crawl the DOM and Selenium to handle these links containing a #. Just passing the link containing the # to Selenium driver with driver.get("www.myserver.com/products#/page-2") works.

这篇关于Beautifulsoup并以哈希号链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-01 11:30