问题描述
我正在尝试加载html页面并输出文本,即使我正确地获取了网页,BeautifulSoup也以某种方式破坏了编码。
I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding.
资料来源:
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup
url = "http://www.columbia.edu/~fdc/utf8/"
r = requests.get(url)
encodedText = r.text.encode("utf-8")
soup = BeautifulSoup(encodedText)
text = str(soup.findAll(text=True))
print text.decode("utf-8")
摘录输出:
...Odenw\xc3\xa4lderisch...
这应该是Odenwälderisch
推荐答案
你犯了两个错误;您是错误处理编码,而您将结果列表作为可以安全地转换为字符串而不会丢失信息的内容。
You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.
首先, t使用 response.text
!这里不是BeautifulSoup,您需要重新编码。当服务器没有明确指定时,请求
库将默认为 text / *
内容类型的Latin-1编码一个编码,因为HTTP标准声明这是默认值。
First of all, don't use response.text
! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests
library will default to Latin-1 encoding for text/*
content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.
请参阅:
强调我的。
通过 response.content
原始数据:
soup = BeautifulSoup(r.content)
我看到你使用的是BeautifulSoup 3.你真的要升级到BeautifulSoup 4;版本3已在2012年停产,并包含几个错误。安装,然后使用从bs4导入BeautifulSoup
。
I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4
project, and use from bs4 import BeautifulSoup
.
BeautifulSoup 4通常做一个很好的工作,找出正确的编码在解析时使用,无论是从HTML < meta>
提供的字节的标签或统计分析。如果服务器提供了一个字符集,你仍然可以将它从响应中传递给BeautifulSoup,但是如果请求
使用默认值,请先测试:
BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta>
tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests
used a default:
encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
soup = BeautifulSoup(r.content, from_encoding=encoding)
最后但并非最不重要的是,使用BeautifulSoup 4您可以使用 soup.get_text()
从页面中提取所有文本:
Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text()
:
text = soup.get_text()
print text
结果列表(返回值 soup.findAll()
)到一个字符串。这不能工作,因为Python中的容器在列表中的每个元素上使用 repr()
来生成调试字符串,对于表示您的字符串获得任何不可打印的ASCII字符的转义序列。
You are instead converting a result list (the return value of soup.findAll()
) to a string. This never can work because containers in Python use repr()
on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.
这篇关于Python正确编码网站(美丽的汤)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!