问题描述
我用美丽的汤抽取数据。该BS文档指出BS应该总是返回的Uni code,但我似乎无法得到统一code。这里有一个code段
I'm using Beautiful soup to scrape data. The BS documentation states that BS should always return Unicode but I can't seem to get Unicode. Here's a code snippet
import urllib2
from libs.BeautifulSoup import BeautifulSoup
# Fetch and parse the data
url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern'
data = urllib2.urlopen(url).read()
print 'Encoding of fetched HTML : %s', type(data)
soup = BeautifulSoup(data)
print 'Encoding of souped up HTML : %s', soup.originalEncoding
table = soup.table
print type(table.renderContents())
从页返回的原始数据是字符串。 BS显示原始编码为ISO-8859-1。我认为,BS自动转换一切的Uni code那么为什么,当我做到这一点:
The original data returned from the page is a string. BS shows the original encoding as ISO-8859-1. I thought that BS automatically converted everything to Unicode so why is it that when I do this:
table = soup.table
print type(table.renderContents())
..它给了我一个字符串对象,而不是统一code?
..it gives me a string object and not Unicode?
如何从BS得到一个统一code对象?
How can i get a Unicode objects from BS?
我真的,真的失去了与此有关。任何帮助吗?先谢谢了。
I'm really, really lost with this. Any help? Thanks in advance.
推荐答案
正如你可能已经注意到renderContent回报(默认)的字符串连接在UTF-8 codeD,但如果你真的想要一个统一code字符串重新presenting整个文档,你也可以做单code(汤),或者去code renderContents / prettify的输出使用UNI code(汤。prettify (),UTF-8)。
As you may have noticed renderContent returns (by default) a string encoded in UTF-8, but if you really want a Unicode string representing the entire document you can also do unicode(soup) or decode the output of renderContents/prettify using unicode(soup.prettify(), "utf-8").
相关
- How to render contents of a tag in unicode in BeautifulSoup?
这篇关于BeautifulSoup不给我的Uni code的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!