最近出差学习,闲来撸一把 Python。看语法书这些,真是看完就忘,还不如来写点小程序,有实践性又有趣。

我的环境是Ubuntu 17,开始之前先装几个依赖包,用于解析 html 文件。

sudo apt install python-lxml,python-requests

小程序实现从豆瓣读书上抓取评分8以上,且评分人数不低于800人的书籍。这里取了一个种子,是刘震云老师的《一句顶一万句》。

from lxml import html
import requests

urlPrefix = 'https://book.douban.com/subject/'
candidateBookNums = []
candidateBookNums.append('3633461')

selectedBooks = {}

# 控制循环次数
# i = 1

while candidateBookNums:

    bookNum = candidateBookNums.pop(0)
    bookUrl = urlPrefix + str(bookNum)

    # 获取网页
    page = requests.get(bookUrl)

    # 将网页格式化为树型
    tree = html.fromstring(page.text)

    # 书籍名称
    bookName = tree.xpath('//title/text()')
    # 平均分
    rating_num = tree.xpath('//strong[@property="v:average"]/text()')[0]
    # 评分人数
    rating_people = tree.xpath('//a/span[@property="v:votes"]/text()')[0]

    if rating_num < 8 or rating_people < 800:
        continue

    stars = tree.xpath('//span[@class="rating_per"]/text()')
    # 5星评价比例
    stars5 = stars[0]
    # 4星评价比例
    stars4 = stars[1]
    # 3星评价比例
    stars3 = stars[2]
    # 2星评价比例
    stars2 = stars[3]
    # 1星评价比例
    stars1 = stars[4]
    # 豆瓣读书中指向其他书的链接
    links = tree.xpath('//div[@class="content clearfix"]/dl/dd/a/@href')

    # 去掉空白符,如回车、换行、空格、缩进
    bookName = bookName[0].strip()

    # 整理豆瓣上书籍的评分信息
    book = {
        'name':bookName,
        'score':rating_num,
        'rating_people':rating_people,
        'stars5':stars5,
        'stars4':stars4,
        'stars3':stars3,
        'stars2':stars2,
        'stars1':stars1,
    }
    selectedBooks[bookNum] = book
    print bookName,book

    for j in links:
        bookNum = j.split('/')[-2]
        if bookNum not in selectedBooks.keys() and bookNum not in candidateBookNums:
            candidateBookNums.append(bookNum)

    # i += 1
    # if i > 100:
    #   break

print selectedBooks

OK,这样就完成了一个简单的从豆瓣抓取符合要求的书籍的程序。其实实现倒是次要的,主要是从豆瓣读书的页面代码中找到相应信息的位置,提取之。

12-23 16:50