本文介绍了BeautifulSoup不给我的Uni code的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用美丽的汤抽取数据。该BS文档指出BS应该总是返回的Uni code,但我似乎无法得到统一code。这里有一个code段

I'm using Beautiful soup to scrape data. The BS documentation states that BS should always return Unicode but I can't seem to get Unicode. Here's a code snippet

import urllib2
from libs.BeautifulSoup import BeautifulSoup

# Fetch and parse the data
url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern'

data = urllib2.urlopen(url).read()
print 'Encoding of fetched HTML : %s', type(data)

soup = BeautifulSoup(data)
print 'Encoding of souped up HTML : %s', soup.originalEncoding

table = soup.table
print type(table.renderContents())

从页返回的原始数据是字符串。 BS显示原始编码为ISO-8859-1。我认为,BS自动转换一切的Uni code那么为什么,当我做到这一点:

The original data returned from the page is a string. BS shows the original encoding as ISO-8859-1. I thought that BS automatically converted everything to Unicode so why is it that when I do this:

table = soup.table
print type(table.renderContents())

..它给了我一个字符串对象,而不是统一code?

..it gives me a string object and not Unicode?

如何从BS得到一个统一code对象?

How can i get a Unicode objects from BS?

我真的,真的失去了与此有关。任何帮助吗?先谢谢了。

I'm really, really lost with this. Any help? Thanks in advance.

推荐答案

正如你可能已经注意到renderContent回报(默认)的字符串连接在UTF-8 codeD,但如果你真的想要一个统一code字符串重新presenting整个文档,你也可以做单code(汤),或者去code renderContents / prettify的输出使用UNI code(汤。prettify (),UTF-8)。

As you may have noticed renderContent returns (by default) a string encoded in UTF-8, but if you really want a Unicode string representing the entire document you can also do unicode(soup) or decode the output of renderContents/prettify using unicode(soup.prettify(), "utf-8").

相关



  • How to render contents of a tag in unicode in BeautifulSoup?

这篇关于BeautifulSoup不给我的Uni code的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-12 14:06