问题描述
我正在使用 https://www.realtor.com/realestateagents/phoenix_az//pg-2 作为我的起点.我想从第 2 页到第 5 页以及中间的每一页,同时收集姓名和数字.我正在完美地收集第 2 页上的信息,但是我无法在无需插入新网址的情况下将其转到下一页.我正在尝试设置一个循环来自动执行此操作,但是在对我认为是循环的内容进行编码后,我只是在刮板停止之前仅在第 2 页(起点)上获取信息.我也是循环新手,尝试了多种方法,但都无法正常工作.
I am using https://www.realtor.com/realestateagents/phoenix_az//pg-2 as my starting point. I want to go from page 2 to page 5 and each page in-between while collecting names and numbers. I am collecting information on page 2 perfectly however I can not get it to go to the next page without having to plug in a new url. I am trying to set up a loop to do this automatically however after coding what I thought would be a loop im just getting the information only on page 2 (the starting point) before the scraper stops.I am new too loops and have tried multiple ways but can get none to work.
以下是目前的完整代码.
Below is the complete code for now.
import requests
from requests import get
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import numpy as np
from numpy import arange
import pandas as pd
from time import sleep
from random import randint
headers = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/45.0.2454.101 Safari/537.36'),
'referer': 'https://www.realtor.com/realestateagents/phoenix_az//pg-2'}
my_url = 'https://www.realtor.com/realestateagents/phoenix_az//pg-2'
#opening up connection, grabbing the page
uClient = uReq(my_url)
#read page
page_html = uClient.read()
#close page
uClient.close()
pages = np.arange(2, 3, 1)
for page in pages:
page = requests.get("https://www.realtor.com/realestateagents/phoenix_az//pg-" , headers=headers)
#html parsing
page_soup = soup(page_html, "html.parser")
#finds all realtors on page
containers = page_soup.findAll("div",{"class":"agent-list-card clearfix"})
#creating csv file
filename = "phoenix.csv"
f = open(filename, "w")
headers = "agent_name, agent_number\n"
f.write(headers)
#controlling scrape speed
for container in containers:
try:
name = container.find('div', class_='agent-name text-bold')
agent_name = name.a.text.strip()
except AttributeError:
print("-")
try:
number = container.find('div', class_='agent-phone hidden-xs hidden-xxs')
agent_number = number.text.strip()
except AttributeError:
print("-")
except NameError:
print("-")
try:
print("name: " + agent_name)
print("number: " + agent_number)
except NameError:
print("-")
try:
f.write(agent_name + "," + agent_number + "\n")
except NameError:
print("-")
f.close()
推荐答案
不确定这是否是您需要的,但这里有一个基于您的示例的工作(和简化)代码,它抓取了前五页.
Not sure if that's what you need, but here's a working (and simplified) code based on your example that scrapes the first five pages.
如果你仔细观察,我正在使用 for 循环
来移动"通过将页码附加到 url 来浏览页面.然后,我得到 HTML,将其解析为代理 div,获取名称和编号(如果 None
然后我添加 N/A
),最后将列表转储到 csv
文件.
If you take a close look, I'm using a for loop
to "move" thru the pages by appending the page number to the url. Then, I get the HTML, parse it for agent div, grab name and number (if None
then I add N/A
) and finally dump the list to a csv
file.
为了匹配评论,我添加了一个城市 Pheonix
和一个 wait_for
功能,可以在 1 到 10 秒之间的任何时间停止脚本,可调整.
To match the comments, I've added a city Pheonix
and a wait_for
feature that stops the script for any time between 1 to 10 seconds, adjustable.
import csv
import random
import time
import requests
from bs4 import BeautifulSoup
realtor_data = []
for page in range(1, 6):
print(f"Scraping page {page}...")
url = f"https://www.realtor.com/realestateagents/phoenix_az/pg-{page}"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
for agent_card in soup.find_all("div", {"class": "agent-list-card clearfix"}):
name = agent_card.find("div", {"class": "agent-name text-bold"}).find("a")
number = agent_card.find("div", {"itemprop": "telephone"})
realtor_data.append(
[
name.getText().strip(),
number.getText().strip() if number is not None else "N/A",
"Pheonix",
],
)
wait_for = random.randint(1, 10)
print(f"Sleeping for {wait_for} seconds...")
time.sleep(wait_for)
with open("data.csv", "w") as output:
w = csv.writer(output)
w.writerow(["NAME:", "PHONE NUMBER:"])
w.writerows(realtor_data)
输出:
带有房地产经纪人姓名和电话号码的 .csv
文件.
A .csv
file with realtor's name and phone number.
NAME: PHONE NUMBER: CITY:
------------------------ --------------- -------
Shawn Rogers (480) 313-7031 Pheonix
The Jason Mitchell Group (480) 470-1993 Pheonix
Kyle Caldwell (602) 390-2245 Pheonix
THE VALENTINE GROUP N/A Pheonix
Nancy Wolfe (602) 418-1010 Pheonix
Rhonda DuBois (623) 418-2970 Pheonix
Sabrina Hurley (602) 410-1985 Pheonix
Bryan Adams (480) 375-1292 Pheonix
DeAnn Fry (623) 748-3818 Pheonix
Esther P Goh (480) 703-3836 Pheonix
...
这篇关于Webscraper 不会从第 2 页循环到第 5 页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!