使用python 2.7.6中的bs4解析此示例文档:
Parsing this sample document with bs4, from python 2.7.6:
<p>HTML allows omitting P end-tags.
<p>Like that and this.
<p>And this, too.
<p>What happened?</p>
<p>And can we <p>nest a paragraph, too?</p></p>
from bs4 import BeautifulSoup as BS
tree = BS(fh)
HTML has, for ages, allowed omitted end-tags for various element types, including P (check the schema, or a parser). However, bs4's prettify() on this document shows that it doesn't end any of those paragraphs until it sees </body>:
HTML allows omitting P end-tags.
Like that and this.
And this, too.
What happened?
And can we
nest a paragraph, too?
It's not prettify()'s fault, because traversing the tree manually I get the same structure:
HTML allows omitting P end-tags.␊␊
Like that and this.␊␊
And this, too.␊␊
What happened?
And can we
nest a paragraph, too?
Now, this would be the right result for XML (at least up to </body>, at which point it should report a WF error). But this ain't XML. What gives?
位于 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser 介绍如何使BS4使用不同的解析器.显然,默认值是html.parse,BS4文档说它在Python 2.7.3之前就被破坏了,但是显然仍然存在上面2.7.6中描述的问题.
The doc at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser tells how to get BS4 to use different parsers. Apparently the default is html.parse, which the BS4 doc says is broken before Python 2.7.3, but apparently still has the problem described above in 2.7.6.
Switching to "lxml" was unsuccessful for me, but switching to "html5lib" produces the correct result:
tree = BS(htmSource, "html5lib")