python - 将以utf-8字符串为内容的unicode转换为str

我正在使用pyquery来解析页面:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()

但是我在content中得到的是一个带有utf-8编码内容的unicode字符串:

u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'

如何在不丢失内容的情况下将其转换为str？

弄清楚:

我想要conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
不是conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

最佳答案

如果您有一个带有UTF-8字节的unicode值，请编码为Latin-1以保留“字节”:

content = content.encode('latin1')

因为Unicode代码点U + 0000到U + 00FF都使用latin-1编码一对一映射；因此，此编码会将您的数据解释为文字字节。

对于您的示例，这给了我:

>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表

PyQuery使用requests或urllib检索HTML，在requests的情况下，使用响应的.text属性。这仅基于Content-Type header 中设置的编码对响应数据进行自动解码，或者如果该信息不可用，则为此使用latin-1(用于文本响应，但HTML是文本响应)。您可以通过传入encoding参数来覆盖它:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

在这一点上，您根本不需要重新编码。

关于python - 将以utf-8字符串为内容的unicode转换为str，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/14539807/