本文介绍了如何使用 selenium python 在网站中抓取 ::before 元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 selenium 从该网站上抓取电话号码.我发现这个类是tel ttel",但是当我尝试通过 find_element_by_xpath 抓取网站时.我得到一个空字符串.

我的代码:

wd = webdriver.Chrome(chrome_path)url = 'https://www.justdial.com/Bangalore/Spardha-Mithra-IAS-KAS-Coaching-Centre-Opposite-Maruthi-Medicals-Vijayanagar/080PXX80-XX80-140120184741-R6P8_BZDETFmJbhMhFmJbxhFmJhBhFmJbhMhFwd.get(url)phone = wd.find_element_by_xpath('//a[@class="tel ttel"]').text打印(电话)

输出:

''

电话号码在这里:

电话号码的 Inspect 元素是:

解决方案

你不需要硒.应用提供伪

这里是 .icon- 之后的 2/3 个字母串,例如acb 映射到 span 元素,这些元素包含你的 before 内容.9d0 之后的值是显示的实际值的 + 1.您可以从这些值对(经过调整)创建字典,以从 span 类值中解码每个 before 处的数字.

2/3 字母字符串如何映射到内容的示例:

我的方法可能有点冗长,因为我不太熟悉 Python,但逻辑应该很清楚.

导入请求进口重新从 bs4 导入 BeautifulSoupurl = 'https://www.justdial.com/Bangalore/Spardha-Mithra-IAS-KAS-Coaching-Centre-Opposite-Maruthi-Medicals-Vijayanagar/080PXX80-XX80-140120184741-R6P8_BZDETFmJbhMhFmJbxhFmJhBhFmJbhMhFres = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0'})汤 = BeautifulSoup(res.content, 'lxml')cipherKey = str(soup.select('style[type="text/css"]')[1])keys = re.findall('-(w+):before', cipherKey, flags=0)values = [int(item)-1 for item in re.findall('9d0(d+)', cipherKey, flags=0)]cipherDict = dict(zip(keys,values))cipherDict[list(cipherDict.keys())[list(cipherDict.values()).index(10)]] = '+'decodeElements = [item['class'][1].replace('icon-','') for Soup.select('.telCntct span[class*="icon"]')]phoneNumber = ''.join([str(cipherDict.get(i)) for i in decodeElements])打印(电话号码)

I am trying to scrape phone number from this website using selenium. I found the class to be "tel ttel" but when I try to scrape the website by find_element_by_xpath. I get an empty string.

My code:

wd = webdriver.Chrome(chrome_path)
url = 'https://www.justdial.com/Bangalore/Spardha-Mithra-IAS-KAS-Coaching-Centre-Opposite-Maruthi-Medicals-Vijayanagar/080PXX80-XX80-140120184741-R6P8_BZDET?xid=QmFuZ2Fsb3JlIEJhbmsgRXhhbSBUdXRvcmlhbHM='
wd.get(url)
phone = wd.find_element_by_xpath('//a[@class="tel ttel"]').text
print(phone)

Output:

The phone number is located over here:

The Inspect element for the phone number is:

解决方案

You don't need selenium. The instructions to apply the content which gives the pseudo before elements their values is carried in the css style instructions:

Here, the 2/3 letter strings after the .icon- e.g. acb map to the span elements which house your before content. The values after 9d0 are + 1 of the actual value shown. You can create a dictionary from these pairs of values (with the adjustment) to decode the number at each before from the span class value.

Example of how 2/3 letter strings map to content:

My method is perhaps a little verbose as I am not that familiar with Python but the logic should be clear.

import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.justdial.com/Bangalore/Spardha-Mithra-IAS-KAS-Coaching-Centre-Opposite-Maruthi-Medicals-Vijayanagar/080PXX80-XX80-140120184741-R6P8_BZDET?xid=QmFuZ2Fsb3JlIEJhbmsgRXhhbSBUdXRvcmlhbHM='
res  = requests.get(url, headers  = {'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(res.content, 'lxml')

cipherKey = str(soup.select('style[type="text/css"]')[1])
keys = re.findall('-(w+):before', cipherKey, flags=0)
values = [int(item)-1 for item in re.findall('9d0(d+)', cipherKey, flags=0)]
cipherDict = dict(zip(keys,values))
cipherDict[list(cipherDict.keys())[list(cipherDict.values()).index(10)]] = '+'
decodeElements = [item['class'][1].replace('icon-','') for item in soup.select('.telCntct span[class*="icon"]')]

telephoneNumber = ''.join([str(cipherDict.get(i)) for i in decodeElements])
print(telephoneNumber)

这篇关于如何使用 selenium python 在网站中抓取 ::before 元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 06:22