小爬新浪新闻AFCCL

1.任务目标：

爬取新浪新闻AFCCL的文章：文章标题、时间、来源、内容、评论数等信息。

2.目标网页：

http://sports.sina.com.cn/z/AFCCL/

3.网页分析

小爬新浪新闻AFCCL-LMLPHP

4.源代码：

#!/usr/bin/env/python

# coding:utf-8

import sys

import requests

from bs4 import BeautifulSoup

import json

import re

if __name__ == '__main__':

	url = 'http://sports.sina.com.cn/z/AFCCL/'

	res = requests.get(url)

	html_doc = res.content

	soup = BeautifulSoup(html_doc, 'html.parser')

	a_list=[]

	#爬取新闻时间，标题，链接

	for news in  soup.select('.news-item'):

		if(len(news.select('h2'))>0):

			h2=news.select('h2')[0].text

			a=news.select('a')[0]['href']

			time=news.select('.time')[0].text

			# print(time,h2,a)

			a_list.append(a)

	#爬取内文资料

	for i in range(len(a_list)):

		url=a_list[i]

		res = requests.get(url)

		html_doc = res.content

		soup = BeautifulSoup(html_doc, 'html.parser')

		#获取文章标题、时间、来源、内容,评论数

		title=soup.select('#j_title')

		if title:

			title = soup.select('#j_title')[0].text.strip()

			time = soup.select('.article-a__time')[0].text.strip()

			source = soup.select('.article-a__source')[0].text.strip()

			content = soup.select('.article-a__content')[0].text.strip()

			#动态生成获取评论的Ajax url eg:'http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=ty&newsid=comos-fykiuaz1429964&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20&jsvar=loader_1504416797470_64712661'

			# print(url)

			pattern_id=r'(fyk\w*).s?html'

			# print(re.search(pattern_id,url).group(1))

			id=re.search(pattern_id,url).group(1)

			url='http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=ty&newsid=comos-'+id+'&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20'

			comments = requests.get(url)

			jd=json.loads(comments.text.strip('var data='))

			commentCount = jd['result']['count']['total'] # 评论数

			print(time,title,source,content)

			print(commentCount)

5.运行结果：

小爬新浪新闻AFCCL-LMLPHP

6.小结：

对于一次请求获得的资源爬取是比较顺利的，对于异步请求的资源需要查看检查器，寻找资源所在请求，正对性的爬取。

eg：“评论及评论数”的爬取。