BeautifulSoup 给了我 unicode+html 符号，而不是直接的 unicode.这是错误还是误解?

本文介绍了BeautifulSoup 给了我 unicode+html 符号，而不是直接的 unicode.这是错误还是误解?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 BeautifulSoup 抓取网站.该网站的页面在我的浏览器中呈现良好:

乐施会题为越位！http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271

特别是单引号和双引号看起来不错.它们看起来像 html 符号而不是 ascii，但奇怪的是，当我在 FF3 中查看源代码时，它们似乎是正常的 ascii.

不幸的是，当我刮擦时，我得到了这样的东西

u'Oxfam Internationalxe2€™ 的报告题为xe2€œOffside！

糟糕，我的意思是:

u'Oxfam Internationalxe2€™ 的报告题为 xe2€œOffside！

页面的元数据表示iso-88959-1"编码.我尝试了不同的编码，使用 unicode->ascii 和 html->ascii 第三方函数，并查看了 MS/iso-8859-1 的差异，但事实是 ™ 与单引号，我似乎无法将 unicode+htmlsymbol 组合转换为正确的 ascii 或 html 符号——以我有限的知识，这就是我寻求帮助的原因.

我会很高兴使用 ascii 双引号，" 或 "

下面的问题是我担心还有其他有趣的符号解码不正确.

xe2€™

下面是一些 python 来重现我所看到的，然后是我尝试过的东西.

进口斜纹从斜纹导入 get_browser从 twill.commands 导入去从 BeautifulSoup 导入 BeautifulSoup 作为 BSoupurl = 'http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271'twill.commands.go(url)汤 = BSoup(twill.commands.get_browser().get_html())ps = 汤.body("p")p = ps[52]>>>p回溯(最近一次调用最后一次):文件<stdin>"，第 1 行，在 <module> 中UnicodeEncodeError: 'ascii' codec can't encode character u'xe2' in position 22: ordinal not in range(128)>>>字符串u'Oxfam Internationalxe2€™ 的报告题为 xe2€œOffside!<elided>
'

http://groups.google.com/group/comp.lang.python/browse_frm/thread/9b7bb3f621b4b8e4/3b00a890cf3a5e46?q=htmlentitydefs&rnum=3&hl=en#3b00a890cf3a>

http://www.fourmilab.ch/webtools/demoroniser/

http://www.crummy.com/software/BeautifulSoup/documentation.html

http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

>>>AsciiDammit.asciiDammit(p.decode())你<p>乐施会xe2€™的报告题为xe2€——越位！>>>handle_html_entities(p.decode())u'<p>Oxfam Internationalxe2u20acu2122s 报告题为 xe2u20acu0153Offside！>>>unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')'<p>乐施会国际的报告题为越位！>>>htmlStripEscapes(p.string)u'Oxfam Internationalxe2TMs 报告题为 xe2Offside！

我尝试使用不同的 BS 解析器:

导入 html5libbsoup_parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("beautifulsoup"))汤 = bsoup_parser.parse(twill.commands.get_browser().get_html())ps = 汤.body("p")ps[55].decode()

这给了我这个

u'

Oxfam Internationalxe2u20acu2122s 报告题为 xe2u20acu0153Offside！

最佳情况解码似乎给了我相同的结果:

unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')'<p>Oxfam InternationalTMs 报告题为越位！

编辑 2:

我在 Mac OS X 4 上运行 FF 3.0.7 和 Firebug

Python 2.5(哇，不敢相信我从一开始就没有说明这一点)

解决方案

这是一个严重混乱的页面，编码方面 :-)

您的方法完全没有问题.我可能倾向于在将其传递给 BeautifulSoup 之前进行转换，只是因为我很挑剔:

导入urllibhtml = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()h = html.decode('iso-8859-1')汤 = BeautifulSoup(h)

在这种情况下，页面的元标记在编码方面撒谎.页面实际上是utf-8... Firefox的页面信息揭示了真正的编码，你实际上可以在服务器返回的响应头中看到这个字符集:

curl -i http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271HTTP/1.1 200 正常连接:关闭日期:2009 年 3 月 10 日，星期二 13:14:29 GMT服务器:Microsoft-IIS/6.0X-Powered-By: ASP.NET设置-Cookie:COMPANYID=271；路径=/内容语言:en-US内容类型:文本/html；字符集=UTF-8

如果您使用utf-8"进行解码，它将对您有用(或者至少对我有用):

导入urllibhtml = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()h = html.decode('utf-8')汤 = BeautifulSoup(h)ps = 汤.body("p")p = ps[52]打印

I'm using BeautifulSoup to scrape a website. The website's page renders fine in my browser:

In particular, the single and double quotes look fine. They look html symbols rather than ascii, though strangely when I view source in FF3 they appear to be normal ascii.

Unfortunately, when I scrape I get something like this

oops, I mean this:

u'Oxfam Internationalxe2€™s report entitled xe2€œOffside!

The page's meta data indicates 'iso-88959-1' encoding. I've tried different encodings, played with unicode->ascii and html->ascii third party functions, and looked at the MS/iso-8859-1 discrepancy, but the fact of the matter is that ™ has nothing to do with a single quote, and I can't seem to turn the unicode+htmlsymbol combo into the right ascii or html symbol--in my limited knowledge, which is why I'm seeking help.

I'd be happy with an ascii double quote, " or "

The problem the following is that I'm concerned there are other funny symbols decoded incorrectly.

xe2€™

Below is some python to reproduce what I'm seeing, followed by the things I've tried.

import twill
from twill import get_browser
from twill.commands import go

from BeautifulSoup import BeautifulSoup as BSoup

url = 'http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271'
twill.commands.go(url)
soup = BSoup(twill.commands.get_browser().get_html())
ps = soup.body("p")
p = ps[52]

>>> p
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'xe2' in position 22: ordinal not in range(128)

>>> p.string
u'Oxfam Internationalxe2€™s report entitled xe2€œOffside!<elided>
'

http://groups.google.com/group/comp.lang.python/browse_frm/thread/9b7bb3f621b4b8e4/3b00a890cf3a5e46?q=htmlentitydefs&rnum=3&hl=en#3b00a890cf3a5e46

http://www.fourmilab.ch/webtools/demoroniser/

http://www.crummy.com/software/BeautifulSoup/documentation.html

http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

>>> AsciiDammit.asciiDammit(p.decode())
u'<p>Oxfam Internationalxe2€™s report entitled xe2€œOffside!

>>> handle_html_entities(p.decode())
u'<p>Oxfam Internationalxe2u20acu2122s report entitled xe2u20acu0153Offside!

>>> unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')
'<p>Oxfam International€™s report entitled €œOffside!

>>> htmlStripEscapes(p.string)
u'Oxfam Internationalxe2TMs report entitled xe2Offside!

EDIT:

I've tried using a different BS parser:

import html5lib
bsoup_parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("beautifulsoup"))
soup = bsoup_parser.parse(twill.commands.get_browser().get_html())
ps = soup.body("p")
ps[55].decode()

which gives me this

u'<p>Oxfam Internationalxe2u20acu2122s report entitled xe2u20acu0153Offside!

the best case decode seems to give me the same results:

unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')
'<p>Oxfam InternationalTMs report entitled Offside!

EDIT 2:

I am running Mac OS X 4 with FF 3.0.7 and Firebug

Python 2.5 (wow, can't believe I didn't state this from the beginning)

解决方案

That's one seriously messed up page, encoding-wise :-)

There's nothing really wrong with your approach at all. I would probably tend to do the conversion before passing it to BeautifulSoup, just because I'm persnickity:

import urllib
html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
h = html.decode('iso-8859-1')
soup = BeautifulSoup(h)

In this case, the page's meta tag is lying about the encoding. The page is actually in utf-8... Firefox's page info reveals the real encoding, and you can actually see this charset in the response headers returned by the server:

curl -i http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271
HTTP/1.1 200 OK
Connection: close
Date: Tue, 10 Mar 2009 13:14:29 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Set-Cookie: COMPANYID=271;path=/
Content-Language: en-US
Content-Type: text/html; charset=UTF-8

If you do the decode using 'utf-8', it will work for you (or, at least, is did for me):

import urllib
html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
h = html.decode('utf-8')
soup = BeautifulSoup(h)
ps = soup.body("p")
p = ps[52]
print p

这篇关于BeautifulSoup 给了我 unicode+html 符号，而不是直接的 unicode.这是错误还是误解?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！