我正在使用 BeautifulSoup 抓取网站.该网站的页面在我的浏览器中呈现良好:


特别是单引号和双引号看起来不错.它们看起来像 html 符号而不是 ascii,但奇怪的是,当我在 FF3 中查看源代码时,它们似乎是正常的 ascii.


u'Oxfam Internationalxe2€™ 的报告题为xe2€œOffside!


u'Oxfam Internationalxe2€™ 的报告题为 xe2€œOffside!

页面的元数据表示iso-88959-1"编码.我尝试了不同的编码,使用 unicode->ascii 和 html->ascii 第三方函数,并查看了 MS/iso-8859-1 的差异,但事实是 ™ 与单引号,我似乎无法将 unicode+htmlsymbol 组合转换为正确的 ascii 或 html 符号——以我有限的知识,这就是我寻求帮助的原因.

我会很高兴使用 ascii 双引号," 或 "



下面是一些 python 来重现我所看到的,然后是我尝试过的东西.

进口斜纹从斜纹导入 get_browser从 twill.commands 导入去从 BeautifulSoup 导入 BeautifulSoup 作为 BSoupurl = 'http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271'twill.commands.go(url)汤 = BSoup(twill.commands.get_browser().get_html())ps = 汤.body("p")p = ps[52]>>>p回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中UnicodeEncodeError: 'ascii' codec can't encode character u'xe2' in position 22: ordinal not in range(128)>>>字符串u'Oxfam Internationalxe2€™ 的报告题为 xe2€œOffside!<elided>





>>>AsciiDammit.asciiDammit(p.decode())你<p>乐施会xe2€™的报告题为xe2€——越位!>>>handle_html_entities(p.decode())u'<p>Oxfam Internationalxe2u20acu2122s 报告题为 xe2u20acu0153Offside!>>>unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')'<p>乐施会国际的报告题为越位!>>>htmlStripEscapes(p.string)u'Oxfam Internationalxe2TMs 报告题为 xe2Offside!

我尝试使用不同的 BS 解析器:

导入 html5libbsoup_parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("beautifulsoup"))汤 = bsoup_parser.parse(twill.commands.get_browser().get_html())ps = 汤.body("p")ps[55].decode()



Oxfam Internationalxe2u20acu2122s 报告题为 xe2u20acu0153Offside!


unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')'<p>Oxfam InternationalTMs 报告题为越位!

编辑 2:

我在 Mac OS X 4 上运行 FF 3.0.7 和 Firebug

Python 2.5(哇,不敢相信我从一开始就没有说明这一点)


这是一个严重混乱的页面,编码方面 :-)

您的方法完全没有问题.我可能倾向于在将其传递给 BeautifulSoup 之前进行转换,只是因为我很挑剔:

导入urllibhtml = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()h = html.decode('iso-8859-1')汤 = BeautifulSoup(h)

在这种情况下,页面的元标记在编码方面撒谎.页面实际上是utf-8... Firefox的页面信息揭示了真正的编码,你实际上可以在服务器返回的响应头中看到这个字符集:

curl -i http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271HTTP/1.1 200 正常连接:关闭日期:2009 年 3 月 10 日,星期二 13:14:29 GMT服务器:Microsoft-IIS/6.0X-Powered-By: ASP.NET设置-Cookie:COMPANYID=271;路径=/内容语言:en-US内容类型:文本/html;字符集=UTF-8


导入urllibhtml = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()h = html.decode('utf-8')汤 = BeautifulSoup(h)ps = 汤.body("p")p = ps[52]打印

I'm using BeautifulSoup to scrape a website. The website's page renders fine in my browser:

In particular, the single and double quotes look fine. They look html symbols rather than ascii, though strangely when I view source in FF3 they appear to be normal ascii.

Unfortunately, when I scrape I get something like this

oops, I mean this:

u'Oxfam Internationalxe2€™s report entitled xe2€œOffside!

The page's meta data indicates 'iso-88959-1' encoding. I've tried different encodings, played with unicode->ascii and html->ascii third party functions, and looked at the MS/iso-8859-1 discrepancy, but the fact of the matter is that ™ has nothing to do with a single quote, and I can't seem to turn the unicode+htmlsymbol combo into the right ascii or html symbol--in my limited knowledge, which is why I'm seeking help.

I'd be happy with an ascii double quote, " or "

The problem the following is that I'm concerned there are other funny symbols decoded incorrectly.


Below is some python to reproduce what I'm seeing, followed by the things I've tried.

import twill
from twill import get_browser
from twill.commands import go

from BeautifulSoup import BeautifulSoup as BSoup

url = 'http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271'
soup = BSoup(twill.commands.get_browser().get_html())
ps = soup.body("p")
p = ps[52]

>>> p
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'xe2' in position 22: ordinal not in range(128)

>>> p.string
u'Oxfam Internationalxe2€™s report entitled xe2€œOffside!<elided>





>>> AsciiDammit.asciiDammit(p.decode())
u'<p>Oxfam Internationalxe2€™s report entitled xe2€œOffside!

>>> handle_html_entities(p.decode())
u'<p>Oxfam Internationalxe2u20acu2122s report entitled xe2u20acu0153Offside!

>>> unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')
'<p>Oxfam International€™s report entitled €œOffside!

>>> htmlStripEscapes(p.string)
u'Oxfam Internationalxe2TMs report entitled xe2Offside!


I've tried using a different BS parser:

import html5lib
bsoup_parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("beautifulsoup"))
soup = bsoup_parser.parse(twill.commands.get_browser().get_html())
ps = soup.body("p")

which gives me this

u'<p>Oxfam Internationalxe2u20acu2122s report entitled xe2u20acu0153Offside!

the best case decode seems to give me the same results:

unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')
'<p>Oxfam InternationalTMs report entitled Offside!


I am running Mac OS X 4 with FF 3.0.7 and Firebug

Python 2.5 (wow, can't believe I didn't state this from the beginning)


That's one seriously messed up page, encoding-wise :-)

There's nothing really wrong with your approach at all. I would probably tend to do the conversion before passing it to BeautifulSoup, just because I'm persnickity:

import urllib
html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
h = html.decode('iso-8859-1')
soup = BeautifulSoup(h)

In this case, the page's meta tag is lying about the encoding. The page is actually in utf-8... Firefox's page info reveals the real encoding, and you can actually see this charset in the response headers returned by the server:

curl -i http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271
HTTP/1.1 200 OK
Connection: close
Date: Tue, 10 Mar 2009 13:14:29 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Set-Cookie: COMPANYID=271;path=/
Content-Language: en-US
Content-Type: text/html; charset=UTF-8

If you do the decode using 'utf-8', it will work for you (or, at least, is did for me):

import urllib
html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
h = html.decode('utf-8')
soup = BeautifulSoup(h)
ps = soup.body("p")
p = ps[52]
print p

