问题描述
我想做搜索引擎,我在一些网站上学习教程.我想测试解析 html
I want to make search engine and I follow tutorial in some web.I want to test parse html
from bs4 import BeautifulSoup
def parse_html(filename):
"""Extract the Author, Title and Text from a HTML file
which was produced by pdftotext with the option -htmlmeta."""
with open(filename) as infile:
html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
d = {'text': html.pre.text}
if html.title is not None:
d['title'] = html.title.text
for meta in html.findAll('meta'):
try:
if meta['name'] in ('Author', 'Title'):
d[meta['name'].lower()] = meta['content']
except KeyError:
continue
return d
parse_html("C:\\pdf\\pydf\\data\\muellner2011.html")
它得到错误
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 867: character maps to <undefined>enter code here
我在网上看到了一些使用 encode() 的解决方案.但我不知道如何在代码中插入 encode() 函数.有人可以帮我吗?
I saw some solutions on the Web using the encode(). But I don't know how to insert encode() function in code. Can anyone help me?
推荐答案
在 Python 3 中,文件作为文本(解码为 Unicode)为您打开;您无需告诉 BeautifulSoup 解码的编解码器.
In Python 3, files are opened as text (decoded to Unicode) for you; you don't need to tell BeautifulSoup what codec to decode from.
如果数据解码失败,那是因为你没有告诉open()
调用读取文件时使用的编解码器;使用 encoding
参数添加正确的编解码器:
If decoding of the data fails, that's because you didn't tell the open()
call what codec to use when reading the file; add the correct codec with an encoding
argument:
with open(filename, encoding='utf8') as infile:
html = BeautifulSoup(infile, "html.parser")
否则文件将使用您的系统默认编解码器打开,这取决于操作系统.
otherwise the file will be opened with your system default codec, which is OS dependent.
这篇关于Python 3 UnicodeDecodeError:“charmap"编解码器无法解码字节 0x9d的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!