使用漂亮的汤从未知数量的页面抓取数据

本文介绍了使用漂亮的汤从未知数量的页面抓取数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从网站中解析一些数据，这些数据分布在多个页面中.

I want to parse some info from website that has data spread among several pages.

问题是我不知道有多少页.可能有2个，但也可能有4个，甚至只有一页.

The problem is I don't know how many pages there are. There might be 2, but there might be also 4, or even just one page.

当我不知道会有多少页时，如何循环浏览页面?

How can I loop over pages when I don't know how many pages there will be?

但是我知道网址格式看起来像下面的代码.

I know however the url pattern which looks something like in the code below.

此外，页面名称不是纯数字，但它们在第2页位于'pe2'中，在第3页位于'pe4'中，依此类推，因此不能仅在range(number)上循环.

Also, the pages names are not plain numbers but they are in 'pe2' for page 2 and 'pe4' for page 3 etc. so can't just loop over range(number).

这个我要修复的循环的伪代码.

This dummy code for the loop I am trying to fix.

pages=['','pe2', 'pe4', 'pe6', 'pe8',]

import requests
from bs4 import BeautifulSoup
for i in pages:
    url = "http://www.website.com/somecode/dummy?page={}".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    #rest of the scraping code

推荐答案

您可以使用while循环，当遇到异常时它将停止运行.

You can use a while loop that will stop to run when encounters an exception.

代码:

from bs4 import BeautifulSoup
from time import sleep
import requests

i = 0
while(True):
    try:
        if i == 0:
            url = "http://www.website.com/somecode/dummy?page=pe"
        else:
            url = "http://www.website.com/somecode/dummy?page=pe{}".format(i)
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')

        #print page url
        print(url)

        #rest of the scraping code

        #don't overflow website
        sleep(2)

        #increase page number
        i += 2
    except:
        break

输出:

http://www.website.com/somecode/dummy?page
http://www.website.com/somecode/dummy?page=pe2
http://www.website.com/somecode/dummy?page=pe4
http://www.website.com/somecode/dummy?page=pe6
http://www.website.com/somecode/dummy?page=pe8
...
... and so on, until it faces an Exception.

这篇关于使用漂亮的汤从未知数量的页面抓取数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！