python - urllib.urlencode不喜欢unicode值: how about this workaround?

如果我有一个像这样的对象:

d = {'a':1, 'en': 'hello'}

...然后我可以将其传递给urllib.urlencode，没问题:

percent_escaped = urlencode(d)
print percent_escaped

但是，如果我尝试传递值类型为unicode的对象，请结束游戏:

d2 = {'a':1, 'en': 'hello', 'pt': u'olá'}
percent_escaped = urlencode(d2)
print percent_escaped # This fails with a UnicodeEncodingError

所以我的问题是有关准备将对象传递给urlencode的可靠方法。

我想到了这个函数，在其中我简单地遍历对象并编码string或unicode类型的值:

def encode_object(object):
  for k,v in object.items():
    if type(v) in (str, unicode):
      object[k] = v.encode('utf-8')
  return object

这似乎可行:

d2 = {'a':1, 'en': 'hello', 'pt': u'olá'}
percent_escaped = urlencode(encode_object(d2))
print percent_escaped

然后输出a=1&en=hello&pt=%C3%B3la，准备传递给POST调用或其他任何东西。

但是我的encode_object函数对我来说真的很摇晃。一方面，它不处理嵌套对象。

另一方面，我对if语句感到紧张。我还应该考虑其他类型吗？

并且像这样的好习惯将某物的type()与 native 对象进行比较吗？

type(v) in (str, unicode) # not so sure about this...

谢谢!

最佳答案

您确实应该紧张。在某些数据结构中可能混合使用字节和文本的整个想法令人震惊。它违反了处理字符串数据的基本原理:在输入时解码，仅在Unicode中工作，在输出时编码。

更新以回应评论:

您将要输出某种HTTP请求。这需要准备为字节字符串。如果您的字典中包含顺序数大于等于128的Unicode字符，则urllib.urlencode无法正确准备该字节字符串的事实确实很不幸。如果您的字典中混用了字节字符串和unicode字符串，则需要小心。让我们检查一下urlencode()的作用:

>>> import urllib
>>> tests = ['\x80', '\xe2\x82\xac', 1, '1', u'1', u'\x80', u'\u20ac']
>>> for test in tests:
...     print repr(test), repr(urllib.urlencode({'a':test}))
...
'\x80' 'a=%80'
'\xe2\x82\xac' 'a=%E2%82%AC'
1 'a=1'
'1' 'a=1'
u'1' 'a=1'
u'\x80'
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\python27\lib\urllib.py", line 1282, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128)

最后两个测试演示了urlencode()的问题。现在让我们看一下str测试。

如果您坚持混合使用，那么至少应确保str对象以UTF-8编码。

'\x80'可疑-不是any_valid_unicode_string.encode('utf8')的结果。
'\xe2\x82\xac'正常；这是u'\u20ac'.encode('utf8')的结果。
'1'是可以的-输入urlencode()时，所有ASCII字符都可以，如果需要，它将进行百分比编码，例如'％'。

这是建议的转换器功能。它不会改变输入字典，也不会返回输入字典(就像您一样)；它返回一个新的字典。如果值是str对象但不是有效的UTF-8字符串，则将强制执行异常。顺便说一句，您对它不处理嵌套对象的担忧有点误导了您的代码，仅对字典起作用，而嵌套字典的概念并没有真正实现。

def encoded_dict(in_dict):
    out_dict = {}
    for k, v in in_dict.iteritems():
        if isinstance(v, unicode):
            v = v.encode('utf8')
        elif isinstance(v, str):
            # Must be encoded in UTF-8
            v.decode('utf8')
        out_dict[k] = v
    return out_dict

这是输出，以相反的顺序使用相同的测试(因为这次令人讨厌的测试位于最前面):

>>> for test in tests[::-1]:
...     print repr(test), repr(urllib.urlencode(encoded_dict({'a':test})))
...
u'\u20ac' 'a=%E2%82%AC'
u'\x80' 'a=%C2%80'
u'1' 'a=1'
'1' 'a=1'
1 'a=1'
'\xe2\x82\xac' 'a=%E2%82%AC'
'\x80'
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 8, in encoded_dict
  File "C:\python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
>>>

有帮助吗？

关于python - urllib.urlencode不喜欢unicode值: how about this workaround?，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/6480723/