问题描述
我正在从
从硒导入Webdriver的 从bs4导入BeautifulSoup从selenium.common.exceptions导入NoSuchElementException导入时间导入json数据= {}浏览器= webdriver.Chrome()url ="https://bodyspace.bodybuilding.com/member-search"browser.get(URL)html = browser.page_source汤= BeautifulSoup(html,"html.parser")#进行分页pages_remaining =真计数器= 1索引= 0而pages_remaining:如果计数器== 60:pages_remaining = False#获取年龄,身高,体重和健身目标** metrics = soup.findAll("div",{"class":"bbcHeadMetrics"})**对于范围(0,len(metrics))中的x:metrics_children = metrics [index] .findChildren()详细信息= soup.findAll("div",{"class":"bbcDetails"})personal_details = details [index] .findChildren()如果len(individual_details)>16:print("index:" + str(counter)+"/Age:" + personal_details [2] .text +"/Height:" + personal_details [4] .text +"/Weight:" + personal_details [7] .text+"/性别:" + personal_details [12] .text +"/目标:" + personal_details [18] .text)别的:print("index:" + str(counter)+"/Age:" + personal_details [2] .text +"/Height:" + personal_details [4] .text +"/Weight:" + personal_details [7] .text+"/性别:" + personal_details [12] .text +"/目标:" + personal_details [15] .text)索引=索引+ 1计数器=计数器+ 1尝试:#转到第2页next_link = browser.find_element_by_xpath('//* [@ title =转到第2页"]')next_link.click()索引= 0时间.睡眠(30)除了NoSuchElementException:rows_remaining = False
有必要更新变量html和汤.
try:#转到第2页next_link = browser.find_element_by_xpath('//* [@ title =转到第2页"]')next_link.click()索引= 0#更新html和汤html = browser.page_source汤= BeautifulSoup(html,"html.parser")时间.睡眠(30)除了NoSuchElementException:rows_remaining = False
我相信您必须这样做,因为URL不会更改,并且正在使用javascript动态生成html.
I'm web scraping data from Bodybuilding.com for a course project and my goal is to scrape member information. I successfully scraped information on the 1st page for 20 members. The problem occurs when I go to the 2nd page. The highlighted portion below shows that index 21 to 40 repeats information from index 1 to 20. And, I don't know why.
I thought line 28 (bolded) would update the variable and the information it stores. But it doesn't seem to change. Does this have to do with the website structure?
I would appreciate any help, thanks.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException
import time
import json
data = {}
browser = webdriver.Chrome()
url = "https://bodyspace.bodybuilding.com/member-search"
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
# Going through pagination
pages_remaining = True
counter = 1
index = 0
while pages_remaining:
if counter == 60:
pages_remaining = False
# FETCH AGE, HEIGHT, WEIGHT, & FITNESS GOAL
**metrics = soup.findAll("div", {"class": "bbcHeadMetrics"})**
for x in range(0, len(metrics)):
metrics_children = metrics[index].findChildren()
details = soup.findAll("div", {"class": "bbcDetails"})
individual_details = details[index].findChildren()
if len(individual_details) > 16:
print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[18].text)
else:
print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[15].text)
index = index + 1
counter = counter + 1
try:
# Go to page 2
next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
next_link.click()
index = 0
time.sleep(30)
except NoSuchElementException:
rows_remaining = False
It's necessary to update the variables html and soup.
try:
# Go to page 2
next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
next_link.click()
index = 0
# update html and soup
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
time.sleep(30)
except NoSuchElementException:
rows_remaining = False
I believe you have to do it this way because the URL doesn't change and the html is being dynamically generated using javascript.
这篇关于通过BeautifulSoup分页进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!