使用正确的字符编码进行抓取(python 请求 + beautifulsoup)

导入请求r = requests.get('http://fm4-archiv.at/files.php?cat=106')>>>type(r.content) # 原始内容<类'字节'>>>>type(r.text) # 解码为 unicode<类'str'>>>>r.headers['内容类型']'文本/html;字符集=UTF-8'>>>编码'UTF-8'>>>汤 = BeautifulSoup(r.text, 'lxml')这将解决Wildlöwenpfleger"问题，但是，页面的其他部分随后开始损坏，例如:>>>汤 = BeautifulSoup(r.text, 'lxml') # 使用解码的字符串...应该可以工作>>>汤.find_all('a')[39]<a href="details.php?file=1882">Der Wildlöwenpfleger</a>>>>汤.find_all('a')[10]<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon bergeben. Auf Streifz gen durch die Popliteratur st.t Hermes auf deren gro e Themen undh rt mit euch quer. In der heutige">Salon Hermes (6 档)显示Wildlöwenpfleger"已修复，但现在第二个链接中的übergeben"和其他人已损坏.似乎在一个 HTML 文档中使用了多种编码.第一个链接使用UTF-8编码:>>>r.content[8013:8070].decode('iso-8859-1')'<a href="details.php?file=1882">Der WildlÃ¶wenpfleger</a>'>>>r.content[8013:8070].decode('utf8')'<a href="details.php?file=1882">Der Wildlöwenpfleger</a>'但第二个链接使用 ISO-8859-1 编码:>>>r.content[2868:3132].decode('iso-8859-1')'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon übergeben. Auf Streifzügen durch die Popliteratur stößt Hermes auf deren große Themen und hört. mit eucherheutige">沙龙爱马仕(6 个文件)\r\n</a>'>>>r.content[2868:3132].decode('utf8', 'replace')'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon bergeben. Auf Streifz gen durch die Popliteratur st t Hermes auf deren gro e Themenund h rt mit euch quer. In der heutige">Salon Hermes(6 个文件)\r\n</a>'显然在同一个 HTML 文档中使用多种编码是不正确的.除了联系文档作者并要求更正之外，您无法轻松处理混合编码.也许您可以在数据上运行 chardet.detect()处理它，但它不会令人愉快.I have an issue parsing this website: http://fm4-archiv.at/files.php?cat=106It contains special characters such as umlauts. See here:My chrome browser displays the umlauts properly as you can see in the screenshot above. However on other pages (e.g.: http://fm4-archiv.at/files.php?cat=105) the umlauts are not displayed properly, as can be seen in the screenshot below:The meta HTML tag defines the following charset on the pages:<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>I use the python requests package to get the HTML and then use Beautifulsoup to scrape the desired data. My code is as follows:r = requests.get(URL)soup = BeautifulSoup(r.content,"lxml")If I print the encoding (print(r.encoding) the result is UTF-8. If I manually change the encoding to ISO-8859-1 or cp1252 by calling r.encoding = ISO-8859-1 nothing changes when I output the data on the console. This is also my main issue.r = requests.get(URL)r.encoding = 'ISO-8859-1'soup = BeautifulSoup(r.content,"lxml")still results in the following string shown on the console output in my python IDE:Der WildlÃ¶wenpflegerinstead it should beDer WildlöwenpflegerHow can I change my code to parse the umlauts properly? 解决方案 In general, instead of using r.content which is the byte string received, use r.text which is the decoded content using the encoding determined by requests.In this case requests will use UTF-8 to decode the incoming byte string because this is the encoding reported by the server in the Content-Type header:import requestsr = requests.get('http://fm4-archiv.at/files.php?cat=106')>>> type(r.content) # raw content<class 'bytes'>>>> type(r.text) # decoded to unicode<class 'str'>>>> r.headers['Content-Type']'text/html; charset=UTF-8'>>> r.encoding'UTF-8'>>> soup = BeautifulSoup(r.text, 'lxml')That will fix the "Wildlöwenpfleger" problem, however, other parts of the page then begin to break, for example:>>> soup = BeautifulSoup(r.text, 'lxml') # using decoded string... should work>>> soup.find_all('a')[39]<a href="details.php?file=1882">Der Wildlöwenpfleger</a>>>> soup.find_all('a')[10]<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)shows that "Wildlöwenpfleger" is fixed but now "übergeben" and others in the second link are broken.It appears that multiple encodings are used in the one HTML document. The first link uses UTF-8 encoding:>>> r.content[8013:8070].decode('iso-8859-1')'<a href="details.php?file=1882">Der WildlÃ¶wenpfleger</a>'>>> r.content[8013:8070].decode('utf8')'<a href="details.php?file=1882">Der Wildlöwenpfleger</a>'but the second link uses ISO-8859-1 encoding:>>> r.content[2868:3132].decode('iso-8859-1')'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon übergeben. Auf Streifzügen durch die Popliteratur stößt Hermes auf deren große Themen und hört mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'>>> r.content[2868:3132].decode('utf8', 'replace')'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'Obviously it is incorrect to use multiple encodings in the same HTML document. Other than contacting the document's author and asking for a correction, there is not much that you can easily do to handle the mixed encoding. Perhaps you can run chardet.detect() over the data as you process it, but it's not going to be pleasant. 这篇关于使用正确的字符编码进行抓取(python 请求 + beautifulsoup)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！