问题描述
我正在处理 Django RSS 阅读器项目 这里.
I am working through the Django RSS reader project here.
RSS 提要会显示类似OKLAHOMA CITY (AP) — James Harden let"的内容.RSS 提要的编码读取 encoding="UTF-8" 所以我相信我在下面的代码片段中将 utf-8 传递给 markdown.em 破折号是它窒息的地方.
The RSS feed will read something like "OKLAHOMA CITY (AP) — James Harden let". The RSS feed's encoding reads encoding="UTF-8" so I believe I am passing utf-8 to markdown in the code snippet below. The em dash is where it chokes.
我收到 Django 错误'ascii' codec can't encode character u'u2014' in position 109: ordinal not in range(128)",这是一个 UnicodeEncodeError.在传递的变量中,我看到OKLAHOMA CITY (AP) u2014 James Harden".不起作用的代码行是:
I get the Django error of "'ascii' codec can't encode character u'u2014' in position 109: ordinal not in range(128)" which is an UnicodeEncodeError. In the variables being passed I see "OKLAHOMA CITY (AP) u2014 James Harden". The code line that is not working is:
content = content.encode(parsed_feed.encoding, "xmlcharrefreplace")
我使用的是 markdown 2.0、django 1.1 和 python 2.4.
I am using markdown 2.0, django 1.1, and python 2.4.
我需要做的编码和解码的神奇序列是什么?
What is the magic sequence of encoding and decoding that I need to do to make this work?
(应普罗米修斯的要求.我同意格式有帮助)
(In response to Prometheus' request. I agree the formatting helps)
所以在视图中,我在 parsed_feed 编码行上方添加了一个 smart_unicode 行...
So in views I add a smart_unicode line above the parsed_feed encoding line...
content = smart_unicode(content, encoding='utf-8', strings_only=False, errors='strict')
content = content = content.encode(parsed_feed.encoding, "xmlcharrefreplace")
这将问题推到我的models.py上,我有
This pushes the problem to my models.py for me where I have
def save(self, force_insert=False, force_update=False):
if self.excerpt:
self.excerpt_html = markdown(self.excerpt)
# super save after this
如果我将保存方法更改为...
If I change the save method to have...
def save(self, force_insert=False, force_update=False):
if self.excerpt:
encoded_excerpt_html = (self.excerpt).encode('utf-8')
self.excerpt_html = markdown(encoded_excerpt_html)
我收到错误 "'ascii' codec can't decode byte 0xe2 in position 141: ordinal not in range(128)" 因为现在它读取 "xe2x80x94" 其中破折号是
I get the error "'ascii' codec can't decode byte 0xe2 in position 141: ordinal not in range(128)" because now it reads "xe2x80x94" where the em dash was
推荐答案
如果您接收的数据实际上是用 UTF-8 编码的,那么它应该是一个字节序列——一个 Python 'str'对象,在 Python 2.X 中
If the data that you are receiving is, in fact, encoded in UTF-8, then it should be a sequence of bytes -- a Python 'str' object, in Python 2.X
您可以使用断言来验证这一点:
You can verify this with an assertion:
assert isinstance(content, str)
一旦你知道这是真的,你就可以转向实际的编码.Python 不进行转码——例如直接从 UTF-8 到 ASCII.您需要首先通过解码将字节序列转换为 Unicode 字符串:
Once you know that that's true, you can move to the actual encoding. Python doesn't do transcoding -- directly from UTF-8 to ASCII, for instance. You need to first turn your sequence of bytes into a Unicode string, by decoding it:
unicode_content = content.decode('utf-8')
(如果您可以信任 parsed_feed.encoding,则使用它而不是文字utf-8".无论哪种方式,都要为错误做好准备.)
(If you can trust parsed_feed.encoding, then use that instead of the literal 'utf-8'. Either way, be prepared for errors.)
然后您可以获取该字符串,并将其编码为 ASCII,并用它们的 XML 实体等效项替换高位字符:
You can then take that string, and encode it in ASCII, substituting high characters with their XML entity equivalents:
xml_content = unicode_content.encode('ascii', 'xmlcharrefreplace')
然后,完整的方法看起来像这样:
The full method, then, would look somthing like this:
try:
content = content.decode(parsed_feed.encoding).encode('ascii', 'xmlcharrefreplace')
except UnicodeDecodeError:
# Couldn't decode the incoming string -- possibly not encoded in utf-8
# Do something here to report the error
这篇关于编码给出“'ascii'编解码器无法编码字符......序数不在范围内(128)";的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!