python - 格式错误的开始标签错误-Python，BeautifulSoup和Sipie-Ubuntu 10.04

我刚刚安装了python，mplayer，beautifulsoup和sipie在我的Ubuntu 10.04机器上运行Sirius。我遵循了一些看似简单的文档，但是遇到了一些问题。我对Python不太熟悉，所以这可能超出了我的范围。

我能够安装所有东西，但是运行sipie可以做到这一点:
/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5Traceback (most recent call last): File "/usr/bin/Sipie/sipie.py", line 22, in <module> Sipie.cliPlayer() File "/usr/bin/Sipie/Sipie/cliPlayer.py", line 74, in cliPlayer completer = Completer(sipie.getStreams()) File "/usr/bin/Sipie/Sipie/Factory.py", line 374, in getStreams streams = self.tryGetStreams() File "/usr/bin/Sipie/Sipie/Factory.py", line 298, in tryGetStreams soup = BeautifulSoup(data) File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1499, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1230, in __init__ self._feed(isHTML=isHTML) File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1263, in _feed self.builder.feed(markup) File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed self.goahead(0) File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag") File "/usr/lib/python2.6/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos())HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3
我浏览了这些文件和行号，但是由于我不熟悉Python，因此没有太大意义。有什么建议下一步呢？

最佳答案

您遇到的问题很常见，并且专门处理格式错误的HTML。在我的情况下，有一个HTML元素，该元素用双引号引起来的属性值。实际上，我今天遇到了这个问题，因此您的帖子中碰到了这个问题。我最终能够通过将HTML5lib解析为HTML，然后再将它交付BeautifulSoup 4来解决此问题。

首先，您需要:

sudo easy_install bs4
sudo apt-get install python-html5lib

然后，运行以下示例代码:

from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
import urllib

url = 'http://the-url-to-scrape'
fp = urllib.urlopen(url)

# Create an html5lib parser. Not sure if the sanitizer is required.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
# Load the source file's HTML into html5lib
html5lib_object = parser.parse(file_pointer)
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
html_string = str(html5lib_object)

# Load the string into BeautifulSoup for parsing.
soup = BeautifulSoup(html_string)

for content in soup.findAll('div'):
    print content

如果您对此代码有任何疑问或需要更多具体指导，请告诉我。 :)

关于python - 格式错误的开始标签错误-Python，BeautifulSoup和Sipie-Ubuntu 10.04，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/3198874/