本文介绍了使用 Selenium 进行 Python 网页抓取 |并行执行(多线程)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!



I have a use case for which I'm unable to develop a logic. Floating it here for recommendations from experts.

我有一个包含 2,500 个 URL 的列表.我可以使用 Python 和 Selenium 依次抓取它们.
1000 个网址的运行时间约为 1.5 小时

Quick context:
I have a list of 2,500 URLs. I am able to scrape them sequentially using Python and Selenium.
Run time for 1,000 URLs is approximately 1.5 hours


What I am trying to achieve:
I am trying to optimize the run time through parallel execution. I had reviewed various posts on stack overflow. Somehow I am unable to find the missing pieces of the puzzle.


  1. 我需要重用驱动程序,而不是为每个 URL 关闭和重新打开它们.我遇到了一篇利用 threading.local() 的帖子 Python selenium multiprocessing.不知何故,如果我重新运行相同的代码,打开的驱动程序数量超过了指定的线程数量

  1. I need to reuse the drivers, instead of closing and reopening them for every URL. I came across a post Python selenium multiprocessing that leverages threading.local(). Somehow the number of drivers that are opened exceed the number of threads specified if I rerun the same code

请注意,该网站要求用户使用用户名和密码登录.我的目标是第一次启动驱动程序(比如 5 个驱动程序)并登录.我想继续为所有未来的 URL 重复使用相同的驱动程序,而不必关闭驱动程序并重新登录

Please note that the website requires the user to login using user name and password. My objective is to launch the drivers (say 5 drivers) the first time and login. I would like to continue reusing the same drivers for all future URLs without having to close the drivers and logging in again

另外,我是 Selenium 网页抓取的新手.只是熟悉基础知识.多线程是未知领域.我非常感谢您的帮助

Also, I am new to Selenium web scraping. Just getting familiar with the basics. Multi-threading is uncharted territory. I would really appreciate your help here


Sharing my code snippet below:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
from multiprocessing.dummy import Pool as ThreadPool

threadLocal = threading.local()

# Function to open web driver
def get_driver():
    options = Options()
    driver = webdriver.Chrome(<Location to chrome driver>, options = options)
    return driver

# Function to login to website & scrape from website
def parse_url(url):
    driver = get_driver()
    login_url = "https://..."

    # Enter user ID
    # Enter password
    # Click on Login button

    # Open web page of interest & scrape
    htmltext = driver.page_source
    htmltext1 = htmltext[0:100]
    return [url, htmltext1]

# Function for multi-threading
def main():
    urls = ["url1",

    pool = ThreadPool(2)
    records = pool.map(parse_url, urls)

    return records

if __name__ =="__main__":
    result = pd.DataFrame(columns = ["url", "html_text"], data = main())


  1. 我最终会重复使用我的驱动程序
  2. 只登录一次网站&并行抓取多个网址



I believe that starting browsers in separate processes and communicate with him via queue is a good approach (and more scalable). Process can be easily killed and respawned if something went wrong. The pseudo-code might look like this:

#  worker.py
def entrypoint(in_queue, out_queue):  # run in process
    crawler = Crawler()
    browser = Browser() # init, login and etc.
    while not stop:
        command = in_queue.get()
        result = crawler.handle(command, browser)

# main.py
import worker

in_queue, out_queue = create_queues()
create_process(worker.entrypoint, args=(in_queue, out_queue))
while not stop:
    result = out_queue.get()

这篇关于使用 Selenium 进行 Python 网页抓取 |并行执行(多线程)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-28 02:52