通过BeautifulSoup分页进行网页抓取

本文介绍了通过BeautifulSoup分页进行网页抓取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从

从硒导入Webdriver的

 从bs4导入BeautifulSoup从selenium.common.exceptions导入NoSuchElementException导入时间导入json数据= {}浏览器= webdriver.Chrome()url ="https://bodyspace.bodybuilding.com/member-search"browser.get(URL)html = browser.page_source汤= BeautifulSoup(html，"html.parser")#进行分页pages_remaining =真计数器= 1索引= 0而pages_remaining:如果计数器== 60:pages_remaining = False#获取年龄，身高，体重和健身目标** metrics = soup.findAll("div"，{"class":"bbcHeadMetrics"})**对于范围(0，len(metrics))中的x:metrics_children = metrics [index] .findChildren()详细信息= soup.findAll("div"，{"class":"bbcDetails"})personal_details = details [index] .findChildren()如果len(individual_details)>16:print("index:" + str(counter)+"/Age:" + personal_details [2] .text +"/Height:" + personal_details [4] .text +"/Weight:" + personal_details [7] .text+"/性别:" + personal_details [12] .text +"/目标:" + personal_details [18] .text)别的:print("index:" + str(counter)+"/Age:" + personal_details [2] .text +"/Height:" + personal_details [4] .text +"/Weight:" + personal_details [7] .text+"/性别:" + personal_details [12] .text +"/目标:" + personal_details [15] .text)索引=索引+ 1计数器=计数器+ 1尝试:#转到第2页next_link = browser.find_element_by_xpath('//* [@ title =转到第2页"]')next_link.click()索引= 0时间.睡眠(30)除了NoSuchElementException:rows_remaining = False

解决方案

有必要更新变量html和汤.

  try:#转到第2页next_link = browser.find_element_by_xpath('//* [@ title =转到第2页"]')next_link.click()索引= 0#更新html和汤html = browser.page_source汤= BeautifulSoup(html，"html.parser")时间.睡眠(30)除了NoSuchElementException:rows_remaining = False

我相信您必须这样做，因为URL不会更改，并且正在使用javascript动态生成html.

I'm web scraping data from Bodybuilding.com for a course project and my goal is to scrape member information. I successfully scraped information on the 1st page for 20 members. The problem occurs when I go to the 2nd page. The highlighted portion below shows that index 21 to 40 repeats information from index 1 to 20. And, I don't know why.

I thought line 28 (bolded) would update the variable and the information it stores. But it doesn't seem to change. Does this have to do with the website structure?

I would appreciate any help, thanks.

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException
import time
import json

data = {}

browser = webdriver.Chrome()
url = "https://bodyspace.bodybuilding.com/member-search"
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, "html.parser")

# Going through pagination
pages_remaining = True
counter = 1
index = 0

while pages_remaining:

    if counter == 60:
        pages_remaining = False

    # FETCH AGE, HEIGHT, WEIGHT, & FITNESS GOAL

    **metrics = soup.findAll("div", {"class": "bbcHeadMetrics"})**

    for x in range(0, len(metrics)):
        metrics_children = metrics[index].findChildren()

        details = soup.findAll("div", {"class": "bbcDetails"})
        individual_details = details[index].findChildren()

        if len(individual_details) > 16:
            print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[18].text)
        else:
            print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[15].text)

        index = index + 1
        counter = counter + 1

    try:
        # Go to page 2
        next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
        next_link.click()
        index = 0
        time.sleep(30)
    except NoSuchElementException:
        rows_remaining = False

解决方案

It's necessary to update the variables html and soup.

try:
    # Go to page 2
    next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
    next_link.click()
    index = 0

    # update html and soup
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")

    time.sleep(30)

except NoSuchElementException:
    rows_remaining = False

I believe you have to do it this way because the URL doesn't change and the html is being dynamically generated using javascript.

这篇关于通过BeautifulSoup分页进行网页抓取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！