本文介绍了Python 硒多处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 python 中结合 selenium 编写了一个脚本,从其登录页面抓取不同帖子的链接,最后通过跟踪通向其内页的 url 来获取每个帖子的标题.虽然我这里解析的内容是静态的,但我使用了 selenium 来看看它在多处理中是如何工作的.

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

但是,我的目的是使用多处理进行抓取.到目前为止,我知道 selenium 不支持多处理,但似乎我错了.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

我的问题:当使用多处理运行时,如何使用 selenium 减少执行时间?

这是我的尝试(这是一个有效的):

import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver

def get_links(link):
  res = requests.get(link)
  soup = BeautifulSoup(res.text,"lxml")
  titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
  return titles

def get_title(url):
  chromeOptions = webdriver.ChromeOptions()
  chromeOptions.add_argument("--headless")
  driver = webdriver.Chrome(chrome_options=chromeOptions)
  driver.get(url)
  sauce = BeautifulSoup(driver.page_source,"lxml")
  item = sauce.select_one("h1 a").text
  print(item)

if __name__ == '__main__':
  url = "https://stackoverflow.com/questions/tagged/web-scraping"
  ThreadPool(5).map(get_title,get_links(url))

推荐答案

您的解决方案中的大量时间都花在为每个 URL 启动 webdriver 上.您可以通过每个线程仅启动一次驱动程序来减少此时间:

A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:

(... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
  driver = getattr(threadLocal, 'driver', None)
  if driver is None:
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    setattr(threadLocal, 'driver', driver)
  return driver


def get_title(url):
  driver = get_driver()
  driver.get(url)
  (...)

(...)

在我的系统上,这将时间从 1m7s 减少到仅 24.895s,改进了约 35%.要测试自己,请下载完整脚本.

On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.

注意:ThreadPool 使用受 Python GIL 约束的线程.如果大部分任务是 I/O 绑定的,那没关系.根据您对抓取结果进行的后处理,您可能需要使用 multiprocessing.Pool 代替.这将启动作为一个组不受 GIL 约束的并行进程.其余代码保持不变.

Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.

这篇关于Python 硒多处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 18:34