在Python 2 + GTK中检测/删除不成对的代理字符

本文介绍了在Python 2 + GTK中检测/删除不成对的代理字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Python 2.7中，我可以成功地将Unicode字符串abc\\\�xyz转换为UTF-8（结果是abc\\xxed \xb0\xb4xyz）。但是当我将UTF-8字符串传递给例如。 pango_parse_markup（）或 g_convert_with_fallback（），我得到像转换输入中的字节序列无效的错误。显然，GTK / Pango函数检测字符串中的unpaired surrogate，并正确地拒绝它。

Python 3甚至不允许转换Unicode字符串（错误：'utf-8'编解码器不能编码字符'\\\�'在位置3：代理不允许），但我可以运行abc\\\�xyz .encode（utf8，替换）得到一个有效的UTF8字符串，并用其他字符替换单独的替代项。这对我来说很好，但我需要Python 2的解决方案。

所以问题是：在Python 2.7中，如何将该Unicode字符串转换为UTF-8用一些替换字符替换单独的替代字符如U + FFFD？最好只使用标准的Python函数和GTK / GLib / G ...函数。

顺便说一下。 Iconv可以将字符串转换为UTF8，但只是删除坏字符，而不是用U + FFFD替换。

解决方案

编码前自己替换：

  import re 
 
 lone = re.compile（
 ur'''（？x）＃verbose expression（允许注释）
（＃开始组
 [\\\�-\\\�]＃匹配前导代理
（？！[\ udc00 -\\\\\\\\\\\\\\\'）＃但是只有当后面的代理商没有跟随
）＃end group 
 |＃
（＃begin group 
（？ [\\\�-\\\�]＃匹配尾部代理
）＃结束组
'''）
 
u = u'abc\\\�\\\�\\\�xyz'
 print repr（u）
b = lone.sub（ur '\\\�'，u）.encode（'utf8'）
 print repr（b）
 print repr（b.decode（'utf8'））
  
 
 
 输出：
  u'abc\ ud834\U0001abcdxyz'
'abc\xef\xbf\xbd\xf0\x9a\xaf\x8dxyz'
 u'abc\\\�\U0001abcdxyz'
  
 
In Python 2.7 I can successfully convert the Unicode string "abc\udc34xyz" to UTF-8 (result is "abc\xed\xb0\xb4xyz"). But when I pass the UTF-8 string to eg. pango_parse_markup() or g_convert_with_fallback(), I get errors like "Invalid byte sequence in conversion input". Apparently the GTK/Pango functions detect the "unpaired surrogate" in the string and (correctly?) reject it.
Python 3 doesn't even allow conversion of the Unicode string to UTF-8 (error: "'utf-8' codec can't encode character '\udc34' in position 3: surrogates not allowed"), but I can run "abc\udc34xyz".encode("utf8", "replace") to get a valid UTF8 string with the lone surrogate replaced by some other character. That's fine for me, but I need a solution for Python 2.
So the question is: in Python 2.7, how can I convert that Unicode string to UTF-8 while replacing the lone surrogate with some replacement character like U+FFFD? Preferably only standard Python functions and GTK/GLib/G... functions should be used.
Btw. Iconv can convert the string to UTF8 but simply removes the bad character instead of replacing it with U+FFFD.
 解决方案 
You can do the replacements yourself before encoding:
import re

lone = re.compile(
    ur'''(?x)            # verbose expression (allows comments)
    (                    # begin group
    [\ud800-\udbff]      #   match leading surrogate
    (?![\udc00-\udfff])  #   but only if not followed by trailing surrogate
    )                    # end group
    |                    #  OR
    (                    # begin group
    (?<![\ud800-\udbff]) #   if not preceded by leading surrogate
    [\udc00-\udfff]      #   match trailing surrogate
    )                    # end group
    ''')

u = u'abc\ud834\ud82a\udfcdxyz'
print repr(u)
b = lone.sub(ur'\ufffd',u).encode('utf8')
print repr(b)
print repr(b.decode('utf8'))
Output:
u'abc\ud834\U0001abcdxyz'
'abc\xef\xbf\xbd\xf0\x9a\xaf\x8dxyz'
u'abc\ufffd\U0001abcdxyz'
                        
这篇关于在Python 2 + GTK中检测/删除不成对的代理字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！