傻姑娘家的李先生

傻姑娘家的李先生

一.https/http开头的图片

1.我们以百度为例,下载百度图片到本地。

selenium爬取图片-LMLPHP

2.定位到该元素的img标签

from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome() 
driver.get("https://www.baidu.com")  # 进入百度首页
driver.implicitly_wait(10)  # 设置全局隐性等待
element = driver.find_element(by=By.XPATH, value="//img[@id='s_lg_img']")  # 定位元素

3.获取图片的地址

url = element.get_attribute("src")  # 获取图片链接

3.将图片以二进制保持到本地

with open("./img/baidu.png", mode="wb")as f:
    f.write(requests.get(url).content)  # 将图片以二进制写入

4.全部代码

from selenium import webdriver
from selenium.webdriver.common.by import By
import requests

driver = webdriver.Chrome()
driver.get("https://www.baidu.com")
driver.implicitly_wait(10)
element = driver.find_element(by=By.XPATH, value="//img[@id='s_lg_img']")
url = element.get_attribute("src")  # 获取图片链接

with open("./img/baidu.png", mode="wb")as f:
    f.write(requests.get(url).content)  # 将图片以二进制写入

driver.quit() 

二.base64加密的图片

1.base64加密过的图片在网页上都是以data:image开头的,不能直接获取图片地址下载,需要先解码后再保存

2.我们直接浏览器上拷贝一个img的src值(base64加密过的),截取需要解码的部分,base64,往后的部分

url_path = ''
url = url_path[22:]

3.通过base64解码为二进制

url_b64 = base64.b64decode(url)

4.保存到本地

with open("./img/base64.png", mode="wb")as f:
    f.write(url_b64)

这个需要注意的是保存的图片后缀要和原文件类型一致,png格式('
url = url_path[22:]
url_b64 = base64.b64decode(url)
with open("./img/base64.png", mode="wb")as f:
    f.write(url_b64)

三.镶嵌在css background-image中的图片

1.我们以获取豆瓣登录页面滑块背景图为例,先看下背景图格式。

selenium爬取图片-LMLPHP

2.先跳转到背景图片的页面

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get("https://www.douban.com/")
iframe = driver.find_element(by=By.XPATH, value="//body/div[@id='anony-reg-new']/div[1]/div[1]/iframe[1]")
driver.switch_to.frame(iframe)
driver.find_element(by=By.XPATH, value="//*[contains(text(),'密码登录')]").click()
driver.find_element(by=By.ID, value="username").send_keys("13633989873")
driver.find_element(by=By.ID, value="password").send_keys("13633989873")
driver.find_element(by=By.XPATH, value="/html/body/div[1]/div[2]/div[1]/div[5]/a").click()

driver.switch_to.frame("tcaptcha_iframe_dy")
time.sleep(3)

 3.编写脚本获取背景图地址

方法一:css = 'return document.getElementById("slideBg").style.backgroundImage'

方法二:js = 'return document.getElementById("slideBg").style.backgroundImage'

selenium爬取图片-LMLPHP

 我们打印出来可以发现,方法一会把域名一起获取了,所以个人推荐使用方法一,不用后续去拼接了

4.截取url地址并以二进制写入

url_path = css_01[5:-2]
with open("./img/tu.png", mode="wb")as f:
    f.write(requests.get(url_path).content)

5.全部代码

import time
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get("https://www.douban.com/")
iframe = driver.find_element(by=By.XPATH, value="//body/div[@id='anony-reg-new']/div[1]/div[1]/iframe[1]")
driver.switch_to.frame(iframe)
driver.find_element(by=By.XPATH, value="//*[contains(text(),'密码登录')]").click()
driver.find_element(by=By.ID, value="username").send_keys("13633989873")
driver.find_element(by=By.ID, value="password").send_keys("13633989873")
driver.find_element(by=By.XPATH, value="/html/body/div[1]/div[2]/div[1]/div[5]/a").click()
driver.switch_to.frame("tcaptcha_iframe_dy")
time.sleep(3)
css = "return $('[id=slideBg]').css('background-image')"
# js = 'return document.getElementById("slideBg").style.backgroundImage'
css_01 = driver.execute_script(css)
# js_01 = driver.execute_script(js)
url_path = css_01[5:-2]
with open("./img/dou.png", mode="wb")as f:
    f.write(requests.get(url_path).content)
driver.quit()

需要注意的是在执行js前需要等待几秒钟,防止报:“is not defined”,要等到页面加载完成后再去执行js

09-17 15:15