问题描述
我正在使用 argparse
来读取我的 Python 代码的参数.这些输入之一是文件 [title
] 的标题,它可以包含 Unicode 字符.我一直用22少女时代22
作为测试字符串.
我需要将输入 title
的值写入文件,但是当我尝试将字符串转换为 UTF-8
时,它总是抛出错误:
UnicodeDecodeError: 'ascii' 编解码器无法解码位置 2 中的字节 0x8f:序数不在范围内(128)
我一直环顾四周,发现我需要我的字符串采用 u"foo"
形式来调用 .encode()
.
当我对来自 argparse
的输入运行 type()
时,我看到:
我希望得到以下回复:
我怎样才能以正确的形式获得它?
想法:
修改 argparse
以接收一个 str
但将其存储为 unicode 字符串 u"foo"
:
parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoding.')
这种方法根本行不通.想法?
编辑 1:
title
为22少女时代22
的部分示例代码:
inputs = vars(parser.parse_args())标题 = 输入[标题"]打印类型(标题)打印类型(u'foo')title = title.encode('utf8') # 这一行抛出错误印刷标题
看起来你的输入数据在 SJIS 编码(日语的传统编码),它在字节串的第 2 位生成字节 0x8f:
>>>'22少女时代22'.encode('sjis')b'22\x8f\xad\x8f\x97\x8e\x9e\x91\xe322'(在 Python 3 提示符下)
现在,将字符串转换为UTF-8",你使用了类似的东西
title.encode('utf8')
问题在于 title
实际上是一个包含 SJIS 编码字符串的字节串.由于 Python 2 中的设计缺陷,字节串可以直接encode
d,并且假定字节串是 ASCII 编码的.所以你所拥有的在概念上等同于
title.decode('ascii').encode('utf8')
当然 decode
调用失败了.
在编码为 UTF-8 之前,您应该将 SJIS 显式解码为 Unicode 字符串:
title.decode('sjis').encode('utf8')
正如 Mark Tolonen 指出的那样,您可能正在将字符输入到您的控制台中,而您的控制台编码是非 Unicode 编码.
所以结果你的 sys.stdin.encoding
是 cp932
,这是微软的 SJIS 变体.为此,使用
title.decode('cp932').encode('utf8')
您确实应该将您的控制台编码设置为标准的 UTF-8,但我不确定这在 Windows 上是否可行.如果这样做,您可以跳过解码/编码步骤,只需将输入的字节串写入文件.
I am using argparse
to read in arguments for my python code. One of those inputs is a title of a file [title
] which can contain Unicode characters. I have been using 22少女時代22
as a test string.
I need to write the value of the input title
to a file, but when I try to convert the string to UTF-8
it always throws an error:
I have been looking around and see I need my string to be in the form u"foo"
to call .encode()
on it.
When I run type()
on my input from argparse
I see:
<type 'str'>
I am looking to get a response of:
<type 'unicode'>
How can I get it in the right form?
Idea:
Modify argparse
to take in a str
but store it as a unicode string u"foo"
:
parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoded.')
This approach is not working at all. Thoughts?
Edit 1:
Some sample code where title
is 22少女時代22
:
inputs = vars(parser.parse_args())
title = inputs["title"]
print type(title)
print type(u'foo')
title = title.encode('utf8') # This line throws the error
print title
It looks like your input data is in SJIS encoding (a legacy encoding for Japanese), which produces the byte 0x8f at position 2 in the bytestring:
>>> '22少女時代22'.encode('sjis')
b'22\x8f\xad\x8f\x97\x8e\x9e\x91\xe322'
(At Python 3 prompt)
Now, to "convert the string to UTF-8", you used something like
title.encode('utf8')
The problem is that title
is actually a bytestring containing the SJIS-encoded string. Due to a design flaw in Python 2, bytestrings can be directly encode
d, and it assumes the bytestring is ASCII-encoded. So what you have is conceptually equivalent to
title.decode('ascii').encode('utf8')
and of course the decode
call fails.
You should instead explicitly decode from SJIS to a Unicode string, before encoding to UTF-8:
title.decode('sjis').encode('utf8')
As Mark Tolonen pointed out, you're probably typing the characters into your console, and it's your console encoding is a non-Unicode encoding.
So it turns out your sys.stdin.encoding
is cp932
, which is Microsoft's variant of SJIS. For this, use
title.decode('cp932').encode('utf8')
You really should set your console encoding to the standard UTF-8, but I'm not sure if that's possible on Windows. If you do, you can skip the decoding/encoding step and just write your input bytestring to the file.
这篇关于Python Unicode 编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!