问题描述
当我运行eclipse或当我在iPython中运行我的脚本时,这是失败的:
It's failing with this when I run eclipse or when I run my script in iPython:
'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)
,但是当我只使用相同的url执行feedparse.parse(url)语句时,没有抛出错误。
I don't know why, but when I simply execute the feedparse.parse(url) statement using the same url, there is no error thrown. This is stumping me big time.
代码很简单:
try:
d = feedparser.parse(url)
except Exception, e:
logging.error('Error while retrieving feed.')
logging.error(e)
logging.error(formatExceptionInfo(None))
logging.error(formatExceptionInfo1())
这里是堆栈跟踪:
d = feedparser.parse(url)
File "C:\Python26\lib\site-packages\feedparser.py", line 2623, in parse
feedparser.feed(data)
File "C:\Python26\lib\site-packages\feedparser.py", line 1441, in feed
sgmllib.SGMLParser.feed(self, data)
File "C:\Python26\lib\sgmllib.py", line 104, in feed
self.goahead(0)
File "C:\Python26\lib\sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "C:\Python26\lib\sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "C:\Python26\lib\sgmllib.py", line 360, in finish_endtag
self.unknown_endtag(tag)
File "C:\Python26\lib\site-packages\feedparser.py", line 476, in unknown_endtag
method()
File "C:\Python26\lib\site-packages\feedparser.py", line 1318, in _end_content
value = self.popContent('content')
File "C:\Python26\lib\site-packages\feedparser.py", line 700, in popContent
value = self.pop(tag)
File "C:\Python26\lib\site-packages\feedparser.py", line 641, in pop
output = _resolveRelativeURIs(output, self.baseuri, self.encoding)
File "C:\Python26\lib\site-packages\feedparser.py", line 1594, in _resolveRelativeURIs
p.feed(htmlSource)
File "C:\Python26\lib\site-packages\feedparser.py", line 1441, in feed
sgmllib.SGMLParser.feed(self, data)
File "C:\Python26\lib\sgmllib.py", line 104, in feed
self.goahead(0)
File "C:\Python26\lib\sgmllib.py", line 138, in goahead
k = self.parse_starttag(i)
File "C:\Python26\lib\sgmllib.py", line 296, in parse_starttag
self.finish_starttag(tag, attrs)
File "C:\Python26\lib\sgmllib.py", line 338, in finish_starttag
self.unknown_starttag(tag, attrs)
File "C:\Python26\lib\site-packages\feedparser.py", line 1588, in unknown_starttag
attrs = [(key, ((tag, key) in self.relative_uris) and self.resolveURI(value) or value) for key, value in attrs]
File "C:\Python26\lib\site-packages\feedparser.py", line 1584, in resolveURI
return _urljoin(self.baseuri, uri)
File "C:\Python26\lib\site-packages\feedparser.py", line 286, in _urljoin
return urlparse.urljoin(base, uri)
File "C:\Python26\lib\urlparse.py", line 215, in urljoin
params, query, fragment))
File "C:\Python26\lib\urlparse.py", line 184, in urlunparse
return urlunsplit((scheme, netloc, url, query, fragment))
File "C:\Python26\lib\urlparse.py", line 192, in urlunsplit
url = scheme + ':' + url
File "C:\Python26\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
部分解决:
当传递给feedparser.parse()的URL是unicode时,这是可重现的。它不会重现,当它是一个ascii URL。为了记录,您需要一个具有一些高字符unicode字符的feed。我不知道为什么是这样。
This is reproducable when the URL being passed to feedparser.parse() is unicode. It won't repro when it's an ascii URL. And for the record, you need a feed that has some high character unicode characters. I am not sure why this is.
推荐答案
看起来像是给你的问题的url包含一些编码的文本latin-1,其中 0xe2
将是小写a顶部有圆圈aka & acirc;
一个合适的内容类型头(它应该在 Content-Type:
中有一个charset =参数,但不是)。
Looks like the url that is giving you problem contains text with some encoding (such as latin-1, where 0xe2
would be "lowercase a with a circle on top" aka â
) without a proper content-type header (it should have a charset= parameter in Content-Type:
but doesn't).
如果情况是 feedparser
无法猜测编码,请尝试使用默认值( ascii
),并失败。
If that is the case feedparser
cannot guess the encoding, tries the default (ascii
), and fails.
不幸的是,没有魔法子弹来解决这个一般性问题(由于bozos破坏了XML规则)。你可以尝试捕获这个异常,并在处理程序分别读取url的内容(使用 urllib2
),并尝试使用各种可能的编码解码它们 - 然后当你终于得到一个可用unicode对象以这种方式,将 添加到 feedparser.parse
(其第一个参数可以是url,文件流, em>一个包含数据的unicode字符串)。
Unfortunately there are no "magic bullets" to solve this general issue (due to bozos that break the XML rules). You could try catching this exception, and in the handler read the url's contents separately (use urllib2
) and try decoding them with various possible encodings -- then when you finally get a usable unicode object this way, feed that to feedparser.parse
(whose first arg can be a url, a file stream, or a unicode string with the data).
这篇关于feedparser在脚本运行期间失败,但不能在交互式python控制台中重现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!