问题描述
我正在尝试使用 selenium 从该网站上抓取电话号码.我发现这个类是tel ttel",但是当我尝试通过 find_element_by_xpath 抓取网站时.我得到一个空字符串.
我的代码:
wd = webdriver.Chrome(chrome_path)url = 'https://www.justdial.com/Bangalore/Spardha-Mithra-IAS-KAS-Coaching-Centre-Opposite-Maruthi-Medicals-Vijayanagar/080PXX80-XX80-140120184741-R6P8_BZDETFmJbhMhFmJbxhFmJhBhFmJbhMhFwd.get(url)phone = wd.find_element_by_xpath('//a[@class="tel ttel"]').text打印(电话)
输出:
''
电话号码在这里:
电话号码的 Inspect 元素是:
你不需要硒.应用提供伪
这里是 .icon-
之后的 2/3 个字母串,例如acb
映射到 span
元素,这些元素包含你的 before
内容.9d0
之后的值是显示的实际值的 + 1.您可以从这些值对(经过调整)创建字典,以从 span
类值中解码每个 before
处的数字.
2/3 字母字符串如何映射到内容的示例:
我的方法可能有点冗长,因为我不太熟悉 Python,但逻辑应该很清楚.
导入请求进口重新从 bs4 导入 BeautifulSoupurl = 'https://www.justdial.com/Bangalore/Spardha-Mithra-IAS-KAS-Coaching-Centre-Opposite-Maruthi-Medicals-Vijayanagar/080PXX80-XX80-140120184741-R6P8_BZDETFmJbhMhFmJbxhFmJhBhFmJbhMhFres = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0'})汤 = BeautifulSoup(res.content, 'lxml')cipherKey = str(soup.select('style[type="text/css"]')[1])keys = re.findall('-(w+):before', cipherKey, flags=0)values = [int(item)-1 for item in re.findall('9d0(d+)', cipherKey, flags=0)]cipherDict = dict(zip(keys,values))cipherDict[list(cipherDict.keys())[list(cipherDict.values()).index(10)]] = '+'decodeElements = [item['class'][1].replace('icon-','') for Soup.select('.telCntct span[class*="icon"]')]phoneNumber = ''.join([str(cipherDict.get(i)) for i in decodeElements])打印(电话号码)
I am trying to scrape phone number from this website using selenium. I found the class to be "tel ttel" but when I try to scrape the website by find_element_by_xpath. I get an empty string.
My code:
wd = webdriver.Chrome(chrome_path)
url = 'https://www.justdial.com/Bangalore/Spardha-Mithra-IAS-KAS-Coaching-Centre-Opposite-Maruthi-Medicals-Vijayanagar/080PXX80-XX80-140120184741-R6P8_BZDET?xid=QmFuZ2Fsb3JlIEJhbmsgRXhhbSBUdXRvcmlhbHM='
wd.get(url)
phone = wd.find_element_by_xpath('//a[@class="tel ttel"]').text
print(phone)
Output:
The phone number is located over here:
The Inspect element for the phone number is:
You don't need selenium. The instructions to apply the content which gives the pseudo before elements their values is carried in the css style instructions:
Here, the 2/3 letter strings after the .icon-
e.g. acb
map to the span
elements which house your before
content. The values after 9d0
are + 1 of the actual value shown. You can create a dictionary from these pairs of values (with the adjustment) to decode the number at each before
from the span
class value.
Example of how 2/3 letter strings map to content:
My method is perhaps a little verbose as I am not that familiar with Python but the logic should be clear.
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.justdial.com/Bangalore/Spardha-Mithra-IAS-KAS-Coaching-Centre-Opposite-Maruthi-Medicals-Vijayanagar/080PXX80-XX80-140120184741-R6P8_BZDET?xid=QmFuZ2Fsb3JlIEJhbmsgRXhhbSBUdXRvcmlhbHM='
res = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(res.content, 'lxml')
cipherKey = str(soup.select('style[type="text/css"]')[1])
keys = re.findall('-(w+):before', cipherKey, flags=0)
values = [int(item)-1 for item in re.findall('9d0(d+)', cipherKey, flags=0)]
cipherDict = dict(zip(keys,values))
cipherDict[list(cipherDict.keys())[list(cipherDict.values()).index(10)]] = '+'
decodeElements = [item['class'][1].replace('icon-','') for item in soup.select('.telCntct span[class*="icon"]')]
telephoneNumber = ''.join([str(cipherDict.get(i)) for i in decodeElements])
print(telephoneNumber)
这篇关于如何使用 selenium python 在网站中抓取 ::before 元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!