本文介绍了使用 Selenium 在 Python 中抓取 JavaScript 注入的图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Mac OSX 上用 Python 制作一个网页抓取工具,我正在测试的一个例子是从 MyFonts 页面(例如 此处).最初我使用的是 BeautifulSoup,但我注意到该站点最初加载了一个blank.png",而不是我试图抓取的字体图像,然后用 js 替换为真实"的图像.我正在尝试使用 Selenium,我可以使用 webdriverwait 来侦听类似于下面示例的 img src 中的更改,但不能通过 ID 或类来侦听吗?

I'm trying to make a web scraper in Python on Mac OSX and an example I'm testing with is to load tags and images from a MyFonts page (eg here). Originally I was using BeautifulSoup but I noticed that the site initially loads with a 'blank.png' in place of the font images I'm trying to grab, which then get replaced with the 'real' ones with js.I'm trying to use Selenium, can I use a webdriverwait to listen for the change in the img src similar to the example below, but not by an ID or Class?

ff = webdriver.Firefox()
ff.get("http://www.myfonts.com/fonts/fort-foundry/gin/")
try:
    element = WebDriverWait(ff, 10).until(EC.presence_of_element_located((By.ID, "myDynamicElement")))
finally:
    ff.quit()

理想情况下,这应该等待 not img src="*/blank.png" 因为元素不会改变类或获得一致的名称.或者我应该等到页面完全加载完成?刮板必须经历很多这样的过程,所以我尽量保持速度相当快.

Ideally this should be waiting for not img src="*/blank.png" since the element doesn't change class or get a consistent name. Or should I just wait until the page finishes loading entirely? The scraper has to go through a lot of these, so I'm trying to keep it fairly quick.

我对 Python 非常陌生,因此非常感谢任何帮助.

I'm very new to Python so any help would be greatly appreciated.

推荐答案

我赞同 Alex 所说的关于合法性的内容,但如果您使用 requests 和 bs4 模拟 Ajax 请求,您也可以获得字体:

I second what Alex said in regard to legality but you could also get the fonts if you mimic the Ajax request with requests and bs4:

In [16]: import requests

In [17]: from bs4 import BeautifulSoup

In [18]: data = {
   ....:     'seed': '24',
   ....:     "text": "Pangrams",
   ....:     "src": "pangram.auto",
   ....:     "size": "72",
   ....:     "fg": "000000",
   ....:     "bg": "ffffff",
   ....:     "goodies": "_2x:0",
   ....:     "w": "720",
   ....:     "i[]": ["fort-foundry/gin/regular,,720", "fort-foundry/gin/oblique,,720", "fort-foundry/gin/rough,,720",
   ....:             "fort-foundry/gin/rough-oblique,,720", "fort-foundry/gin/round,,720","fort-foundry/gin/round-oblique,,720",
   ....:             "fort-foundry/gin/lines,,720", "fort-foundry/gin/lines-oblique,,720"],
   ....:     "showimgs": "true"}

In [19]: js = requests.post("http://www.myfonts.com/ajax-server/testdrive_new-ajax.php", data=data).json()

In [20]:

In [20]: urls = [img["src"] for img in BeautifulSoup("".join(js.values()),"lxml").find_all("img")]

In [21]: pp(urls)
['//samples.myfonts.net/a_91/u/af/5e840d069d35f2c8e5f7077bae7b1e.gif',
 '//samples.myfonts.net/e_91/u/d6/1d63ad993299d182ae19eddb2c41e1.gif',
 '//samples.myfonts.net/e_92/u/7c/15b8e24e4b077ae3b1c7a614afa8b5.gif',
 '//samples.myfonts.net/b_92/u/ce/63dffdda8581fc83f6fe20874714e7.gif',
 '//samples.myfonts.net/e_91/u/51/e8b7a0b5cccb2abf530b05e1d3fb04.gif',
 '//samples.myfonts.net/b_91/u/6f/a5f870c719dcf9961e753b9f4afd7e.gif',
 '//samples.myfonts.net/b_92/u/7c/94d652e4f146801e3c81f694898e07.gif',
 '//samples.myfonts.net/b_91/u/47/39fa3ab779cabd1068abbca7ce98c5.gif']

你只需要传递 i[]: 值,其余的可以用来改变大小、背景颜色等.

The only ones you need to pass are the i[]: values, the rest can be used to change the size, background colour etc..

因此,如果您不关心更改 bg、fg 或大小等并且只使用 bs4 和请求获取所有名称,则可以从 search-result-item 类并使用这些构造 Ajax 请求:

So if you did not care about changing the bg, fg or size etc and to get all the names using just bs4 and requests, you could get the font names from the the search-result-item class and construct the Ajax request using those:

In [1]: import requests

In [2]: from bs4 import BeautifulSoup

In [3]: r = requests.get("http://www.myfonts.com/fonts/fort-foundry/gin/")

In [4]: soup = BeautifulSoup(r.content, "lxml")

# creates fort-foundry/gin/regular,,720" etc..
In [5]: fonts = ["{},,720".format(a["href"].strip("/").split("/", 1)[1])
                   for a in soup.select("div .search-result-item h4 a[href]")]

In [6]: data = {
   ...:     "i[]": fonts
   ...:      }

In [7]: js = requests.post("http://www.myfonts.com/ajax-server/testdrive_new-ajax.php", data=data).json()

In [8]: urls = [img["src"] for img in BeautifulSoup("".join(js.values()),"lxml").select("img[src]")]

In [9]:

In [9]: from pprint import  pprint as pp

In [10]: pp(urls)
['//samples.myfonts.net/b_91/u/06/64bdafe9368dd401df4193a7608028.gif',
 '//samples.myfonts.net/b_92/u/06/b8ad49c563d310a97147d8220f55ab.gif',
 '//samples.myfonts.net/a_91/u/e7/8f84ce98f19e3f91ddc15304d636e7.gif',
 '//samples.myfonts.net/e_91/u/71/9769a1ab626429d63d3c779fcaa3d7.gif',
 '//samples.myfonts.net/b_92/u/65/fe416f15ea94b1f8603ddc675fd638.gif',
 '//samples.myfonts.net/b_91/u/5d/3ced9e71910bc411a0d76316d18df1.gif',
 '//samples.myfonts.net/e_92/u/cd/0df987a72bb0a43cba29b38c16b7a5.gif',
 '//samples.myfonts.net/e_91/u/88/3f80a1108fd0a075c69b09e9c21a8d.gif']

这篇关于使用 Selenium 在 Python 中抓取 JavaScript 注入的图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 06:32