问题描述
在Python 2.7中,我可以成功地将Unicode字符串abc\\\�xyz
转换为UTF-8(结果是abc\\xxed \xb0\xb4xyz
)。但是当我将UTF-8字符串传递给例如。 pango_parse_markup()
或 g_convert_with_fallback()
,我得到像转换输入中的字节序列无效的错误。显然,GTK / Pango函数检测字符串中的unpaired surrogate,并正确地拒绝它。
Python 3甚至不允许转换Unicode字符串(错误:'utf-8'编解码器不能编码字符'\\\�'在位置3:代理不允许),但我可以运行abc\\\�xyz .encode(utf8,替换)
得到一个有效的UTF8字符串,并用其他字符替换单独的替代项。这对我来说很好,但我需要Python 2的解决方案。
所以问题是:在Python 2.7中,如何将该Unicode字符串转换为UTF-8用一些替换字符替换单独的替代字符如U + FFFD?最好只使用标准的Python函数和GTK / GLib / G ...函数。
顺便说一下。 Iconv可以将字符串转换为UTF8,但只是删除坏字符,而不是用U + FFFD替换。
编码前自己替换:
import re
lone = re.compile(
ur'''(?x)#verbose expression(允许注释)
(#开始组
[\\\�-\\\�]#匹配前导代理
(?![\ udc00 -\\\\\\\\\\\\\\\')#但是只有当后面的代理商没有跟随
)#end group
|#
(#begin group
(? [\\\�-\\\�]#匹配尾部代理
)#结束组
''')
u = u'abc\\\�\\\�\\\�xyz'
print repr(u)
b = lone.sub(ur '\\\�',u).encode('utf8')
print repr(b)
print repr(b.decode('utf8'))
输出:
u'abc\ ud834\U0001abcdxyz'
'abc\xef\xbf\xbd\xf0\x9a\xaf\x8dxyz'
u'abc\\\�\U0001abcdxyz'
In Python 2.7 I can successfully convert the Unicode string
"abc\udc34xyz"
to UTF-8 (result is"abc\xed\xb0\xb4xyz"
). But when I pass the UTF-8 string to eg.pango_parse_markup()
org_convert_with_fallback()
, I get errors like "Invalid byte sequence in conversion input". Apparently the GTK/Pango functions detect the "unpaired surrogate" in the string and (correctly?) reject it.Python 3 doesn't even allow conversion of the Unicode string to UTF-8 (error: "'utf-8' codec can't encode character '\udc34' in position 3: surrogates not allowed"), but I can run
"abc\udc34xyz".encode("utf8", "replace")
to get a valid UTF8 string with the lone surrogate replaced by some other character. That's fine for me, but I need a solution for Python 2.So the question is: in Python 2.7, how can I convert that Unicode string to UTF-8 while replacing the lone surrogate with some replacement character like U+FFFD? Preferably only standard Python functions and GTK/GLib/G... functions should be used.
Btw. Iconv can convert the string to UTF8 but simply removes the bad character instead of replacing it with U+FFFD.
解决方案You can do the replacements yourself before encoding:
import re lone = re.compile( ur'''(?x) # verbose expression (allows comments) ( # begin group [\ud800-\udbff] # match leading surrogate (?![\udc00-\udfff]) # but only if not followed by trailing surrogate ) # end group | # OR ( # begin group (?<![\ud800-\udbff]) # if not preceded by leading surrogate [\udc00-\udfff] # match trailing surrogate ) # end group ''') u = u'abc\ud834\ud82a\udfcdxyz' print repr(u) b = lone.sub(ur'\ufffd',u).encode('utf8') print repr(b) print repr(b.decode('utf8'))
Output:
u'abc\ud834\U0001abcdxyz' 'abc\xef\xbf\xbd\xf0\x9a\xaf\x8dxyz' u'abc\ufffd\U0001abcdxyz'
这篇关于在Python 2 + GTK中检测/删除不成对的代理字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!