问题描述
我有一个电子表格的标题包含非ASCII字符,因此:
I have a column a spreadsheet whose header contains non-ASCII characters thus:
'Campaign'
如果我把这个字符串输入解释器,我会得到:
If I pop this string into the interpreter, I get:
'\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
字符串是 csv.DictReader()$的
行
c $ c>
The string is one the keys in the rows
of a csv.DictReader()
当我尝试使用这个键的 value
填充一个新的dict: p>
When I try to populate a new dict with with the value
of this key:
spends['Campaign'] = 2
我得到:
Key Error: '\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
如果我打印行,我可以看到它是'\xef\xbb\xbfCampaign'
If I print the value of the keys of row, I can see that it is '\xef\xbb\xbfCampaign'
只需更新程序即可访问此键:
Obviously then I can just update my program to access this key thus:
spends['\xef\xbb\xbfCampaign']
但是有没有一种更好的方法来做这件事?事实上,如果这个键的值都改变为包含其他非ASCII字符,那么处理任何可能出现的所有非ASCII字符的方法是什么?
But is there a "better" way of doing this, in Python? Indeed, if the value of this key every changes to contain other non-ASCII characters, what is an all-encompassing way of handling any all non-ASCII characters that may arise?
推荐答案
一般来说,应该尽快在输入时使用相应的字符编码将一个字节解码为Unicode文本。反之,将Unicode文本尽可能晚的在输出上编码为一个字节。某些API(例如 io.open()
可以隐式执行,因此您的代码只能看到Unicode)。
In general, you should decode a bytestring into Unicode text using the corresponding character encoding as soon as possible on input. And, in reverse, encode Unicode text into a bytestring as late as possible on output. Some APIs such as io.open()
can do it implicitly so that your code sees only Unicode.
, csv
模块不直接在Python 2上支持Unicode。请参阅 UnicodeReader
, UnicodeWriter
您可以为 csv.DictReader
创建它们的模拟,或者作为替代方法只是通过utf-8编码bytestrings到 csv
模块。
Unfortunately, csv
module does not support Unicode directly on Python 2. See UnicodeReader
, UnicodeWriter
in the doc examples. You could create their analog for csv.DictReader
or as an alternative just pass utf-8 encoded bytestrings to csv
module.
这篇关于在Python中可靠的处理非ASCII字符的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!