本文介绍了如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我运行它取一个UTF-8-CN codeD网页的Python程序,我提取使用BeautifulSoup的HTML一些文本。I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.然而,当我写这篇文章的文本文件(或打印在控制台上),它被写在一个意想不到的编码。However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.样例程序:import urllib2from BeautifulSoup import BeautifulSoup# Fetch URLurl = 'http://www.voxnow.de/'request = urllib2.Request(url)request.add_header('Accept-Encoding', 'utf-8')# Response has UTF-8 charset header,# and HTML body which is UTF-8 encodedresponse = urllib2.urlopen(request)# Parse with BeautifulSoupsoup = BeautifulSoup(response)# Print title attribute of a <div> which uses umlauts (e.g. können)print repr(soup.find('div', id='navbutton_account')['title'])运行此给出结果:Running this gives the result:# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'但我希望一个Python的Uni code字符串来呈现 0 中的字können为 \\ XF6# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'我已经试过了'fromEncoding'参数传递给BeautifulSoup,并试图阅读()和德code()的响应的对象,但它要么没什么区别,或引发错误。I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read() and decode() the response object, but it either makes no difference, or throws an error.通过命令卷曲www.voxnow.de | hexdump都-C ,我可以看到,该网页的确是UTF-8 EN codeD(即它包含 0xc3 0xb6 )的 0 人物:With the command curl www.voxnow.de | hexdump -C, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6) for the ö character: 20 74 69 74 6c 65 3d 22 48 69 65 72 20 6b c3 b6 | title="Hier k..| 6e 6e 65 6e 20 53 69 65 20 73 69 63 68 20 6b 6f |nnen Sie sich ko| 73 74 65 6e 6c 6f 73 20 72 65 67 69 73 74 72 69 |stenlos registri|我超越了我的Python能力的限制,所以我在一个不知如何进一步调试这一点。有什么建议?I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?推荐答案由于justhalf上面所指出的,在这里我的问题本质上是的这个问题。As justhalf points out above, my question here is essentially a duplicate of this question. HTML内容报道本身为UTF-8 EN codeD和,在大多数情况下它是,除了一两个流氓无效的UTF-8字符。The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.这显然是混淆BeautifulSoup哪些编码被使用,并试图去首code为UTF-8传递给BeautifulSoup类似内容时,当这样的:This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup likethis:soup = BeautifulSoup(response.read().decode('utf-8'))我会得到错误:UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: invalid continuation byte在输出更仔细地观察,有哪些是错误的连接$ C $光盘作为无效字节序列字符 U 的一个实例0xe3为0x9c ,而不是正确 0xc3为0x9c 。Looking more closely at the output, there was an instance of the character Ü which was wrongly encoded as the invalid byte sequence 0xe3 0x9c, rather than the correct 0xc3 0x9c.由于目前收视率最高的答案对这个问题所暗示的,无效的UTF-8字符可以在解析被删除,因此,只有有效的数据被传递到BeautifulSoup:As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:soup = BeautifulSoup(response.read().decode('utf-8', 'ignore')) 这篇关于如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
07-31 13:47