What I'm trying to do:I'm getting from a database a list of uris and download them,removing the stopwords and counting the frequency that the words appears in the webpage,then trying to save in the mongodb.
The Problem:When I try to save the result in the database I get the errorbson.errors.invalidDocument: the document must be a valid utf-8
它似乎与代码"\ xc3someotherstrangewords","\ xe2something"相关处理网页时,我会尝试删除标点符号,但无法删除重音符号,因为我会输入错误的单词.
it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something'when I'm processing the webpages I try remove the punctuation, but I can't remove accents because I'll get a wrong word.
What I already triedI've tried identify the char encode through the header from the webpageI've tried utilize the chardet
利用re.compile(r"[^ a-zA-Z]")和/或unicode(变量,'ascii','ignore');
utilize the re.compile(r"[^a-zA-Z]") and/or unicode(variable,'ascii', 'ignore');
that isn't good for non-English languages because they remove the accents.
例如从网页"\ xe2"获取并转换为â"
What I want know is:
anyone know how identify the chars and translate to the right word/encode?
e.g. get this from webpage '\xe2' and translate to 'â'
(English isn't my first language so forgive me) if anyone want see the source code
要找到正确的网站字符编码并不容易,因为标题中的信息可能是错误的. BeautifulSoup 在猜测字符编码并自动对其进行解码方面做得非常好到Unicode.
It is not easy to find out the correct character encoding of a website because the information in the header might be wrong. BeautifulSoup does a pretty good job at guessing the character encoding and automatically decodes it to Unicode.
from bs4 import BeautifulSoup
import urllib
url = 'http://www.google.de'
fh = urllib.urlopen(url)
html = fh.read()
soup = BeautifulSoup(html)
# text is a Unicode string
text = soup.body.get_text()
# encoded_text is a utf-8 string that you can store in mongo
encoded_text = text.encode('utf-8')