I'm currently scraping all the page of a specific website by presetting a variable called number_of_pages. Presetting this variable works until a new page is added that I don't know about. For example the code below is for 3 pages, but the website now has 4 pages.
base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
number_of_pages = 3
for i in range(1, number_of_pages, 1):
url_to_scrape = (base_url + str(i))
I would like to use BeautifulSoup to find all the next links on the website to scrape. The code below finds the second URL, but not the third or fourth. How do I build a list of all the pages prior to scraping them?
base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
CrawlRequest = requests.get(base_url)
raw_html = CrawlRequest.text
linkSoupParser = BeautifulSoup(raw_html, 'html.parser')
page = linkSoupParser.find('div', {'class': 'pagination'})
for list_of_links in page.find('a', href=True, text='next'):
nextURL = 'https://securityadvisories.paloaltonetworks.com' + list_of_links.parent['href']
print (nextURL)
There are several different ways to approach the pagination. Here is one of them.
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
with requests.Session() as session:
page_number = 1
url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
while True:
print("Processing page: #{page_number}; url: {url}".format(page_number=page_number, url=url))
response = session.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# check if there is next page, break if not
next_link = soup.find("a", text="next")
if next_link is None:
url = urljoin(url, next_link["href"])
page_number += 1
If you execute it, you will see the following messages printed:
Processing page: #1; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=
Processing page: #2; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=2
Processing page: #3; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=3
Processing page: #4; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=4
请注意,为了提高性能并在请求中保留cookie,我们正在使用 requests.Session
Note that, to improve on performance and persist cookies across the requests, we are maintaining a web-scraping session with requests.Session