问题描述
我正在编写一个 Python (Python 3.3) 程序,以使用 POST 方法将一些数据发送到网页.主要用于调试过程,我获取页面结果并使用 print()
函数将其显示在屏幕上.
I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print()
function.
代码是这样的:
conn.request("POST", resource, params, headers)
response = conn.getresponse()
print(response.status, response.reason)
data = response.read()
print(data.decode('utf-8'));
HTTPResponse
.read()
方法返回一个 bytes
元素编码页面(这是一个格式良好的 UTF-8 文档)它看起来还不错,直到我停止使用 Windows 的 IDLE GUI 并改用 Windows 控制台.返回的页面有一个 U+2014 字符(长破折号),打印功能在 Windows GUI(我假设代码页 1252)中转换得很好,但在 Windows 控制台(代码页 850)中没有.鉴于 strict
默认行为,我收到以下错误:
the HTTPResponse
.read()
method returns a bytes
element encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strict
default behavior I get the following error:
UnicodeEncodeError: 'charmap' codec can't encode character 'u2014' in position 10248: character maps to <undefined>
我可以使用这个非常难看的代码来修复它:
I could fix it using this quite ugly code:
print(data.decode('utf-8').encode('cp850','replace').decode('cp850'))
现在它用 ?
替换了有问题的字符—".不是理想的情况(连字符应该是更好的替代品)但足以满足我的目的.
Now it replace the offending character "—" with a ?
. Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.
我的解决方案有几处我不喜欢.
There are several things I do not like from my solution.
- 所有解码、编码和解码的代码都很丑陋.
- 它解决了这种情况下的问题.如果我为使用其他编码(latin-1、cp437、回到 cp1252 等)的系统移植程序,它应该能够识别目标编码.它不是.(例如,再次使用 IDLE GUI 时,emdash 也丢失了,这在以前没有发生过)
- 如果将 emdash 翻译成连字符而不是审讯爆炸会更好.
问题不在于 emdash(我可以想到几种方法来解决这个特别的问题),但我需要编写健壮的代码.我正在使用数据库中的数据为页面提供数据,并且该数据可以返回.我可以预见许多其他冲突情况:Á"U+00c1(在我的数据库中是可能的)可以转换为 CP-850(西欧语言的 DOS/Windows 控制台编码),但不能转换为 CP-437(美国的编码)英语,这是许多 Windows 安装的默认设置).
The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an 'Á' U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).
所以,问题:
是否有更好的解决方案使我的代码与输出接口编码无关?
Is there a nicer solution that makes my code agnostic from the output interface encoding?
推荐答案
我看到了三个解决方案:
I see three solutions to this:
改变输出编码,所以它总是输出UTF-8.见例如在 Python 中管道标准输出时设置正确的编码,但我无法让这些示例工作.
Change the output encoding, so it will always output UTF-8. See e.g. Setting the correct encoding when piping stdout in Python, but I could not get these example to work.
以下示例代码使输出了解您的目标字符集.
Following example code makes the output aware of your target charset.
# -*- coding: utf-8 -*-
import sys
print sys.stdout.encoding
print u"Stöcker".encode(sys.stdout.encoding, errors='replace')
print u"Стоескер".encode(sys.stdout.encoding, errors='replace')
这个例子正确地用问号替换了我名字中的任何不可打印的字符.
This example properly replaces any non-printable character in my name with a question mark.
如果您创建自定义打印功能,例如称为 myprint
,使用该机制正确编码输出,您可以在必要时简单地用 myprint
替换 print,而不会使整个代码看起来很丑.
If you create a custom print function, e.g. called myprint
, using that mechanisms to encode output properly you can simply replace print with myprint
whereever necessary without making the whole code look ugly.
在软件开始时全局重置输出编码:
Reset the output encoding globally at the begin of the software:
页面 http://www.macfreek.nl/memory/Encoding_of_Python_stdout总结如何更改输出编码.特别是StreamWriter Wrapper around Stdout"一节很有趣.本质上它说要像这样更改 I/O 编码函数:
The page http://www.macfreek.nl/memory/Encoding_of_Python_stdout has a good summary what to do to change output encoding. Especially the section "StreamWriter Wrapper around Stdout" is interesting. Essentially it says to change the I/O encoding function like this:
在 Python 2 中:
In Python 2:
if sys.stdout.encoding != 'cp850':
sys.stdout = codecs.getwriter('cp850')(sys.stdout, 'strict')
if sys.stderr.encoding != 'cp850':
sys.stderr = codecs.getwriter('cp850')(sys.stderr, 'strict')
在 Python 3 中:
In Python 3:
if sys.stdout.encoding != 'cp850':
sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'cp850':
sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer, 'strict')
如果用于 CGI 输出 HTML,您可以将 'strict' 替换为 'xmlcharrefreplace' 以获得不可打印字符的 HTML 编码标签.
If used in CGI outputting HTML you can replace 'strict' by 'xmlcharrefreplace' to get HTML encoded tags for non-printable characters.
随意修改方法,设置不同的编码,....注意它仍然无法输出非指定的数据.因此,任何数据、输入、文本都必须正确转换为 unicode:
Feel free to modify the approaches, setting different encodings, .... Note that it still wont work to output non-specified data. So any data, input, texts must be correctly convertable into unicode:
# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = codecs.getwriter("iso-8859-1")(sys.stdout, 'xmlcharrefreplace')
print u"Stöcker" # works
print "Stöcker".decode("utf-8") # works
print "Stöcker" # fails
这篇关于UnicodeEncodeError: 'charmap' 编解码器无法编码 - 字符映射到 <undefined>,打印函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!