本文介绍了Beautifulsoup 丢失节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Python 和 Beautifulsoup 来解析 HTML-Data 并从 RSS-Feeds 中获取 p-tags.然而,一些 url 会导致问题,因为解析的汤对象不包括文档的所有节点.

例如,我尝试解析 http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm

但是在将解析后的对象与页面源代码进行比较后,我发现ul class="nextgen-left"之后的所有节点都丢失了.

这是我解析文档的方式:

from bs4 import BeautifulSoup as bsurl = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'cj = cookielib.CookieJar()开瓶器 = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))请求 = urllib2.Request(url)响应 = opener.open(request)汤 = bs(响应,'lxml')打印汤
解决方案

输入的 HTML 不太一致,因此您必须在此处使用不同的解析器.html5lib 解析器正确处理此页面:

>>>进口请求>>>从 bs4 导入 BeautifulSoup>>>r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')>>>汤 = BeautifulSoup(r.text, 'lxml')>>>汤.find('div', id='story-body') 不是 None错误的>>>汤 = BeautifulSoup(r.text, 'html5')>>>汤.find('div', id='story-body') 不是 None真的

I am using Python and Beautifulsoup to parse HTML-Data and get p-tags out of RSS-Feeds. However, some urls cause problems because the parsed soup-object does not include all nodes of the document.

For example I tried to parse http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm

But after comparing the parsed object with the pages source code, I noticed that all nodes after ul class="nextgen-left" are missing.

Here is how I parse the Documents:

from bs4 import BeautifulSoup as bs

url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)

response = opener.open(request)

soup = bs(response,'lxml')
print soup
解决方案

The input HTML is not quite conformant, so you'll have to use a different parser here. The html5lib parser handles this page correctly:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> soup.find('div', id='story-body') is not None
False
>>> soup = BeautifulSoup(r.text, 'html5')
>>> soup.find('div', id='story-body') is not None
True

这篇关于Beautifulsoup 丢失节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 07:42