问题描述
我需要使用BeautifulSoup从页面中获取所有文本.在BeautifulSoup的文档中,它表明您可以执行 soup.get_text()
来执行此操作.当我尝试在reddit.com上执行此操作时,出现以下错误:
I need to get all the text from a page using BeautifulSoup. At BeautifulSoup's documentation, it showed that you could do soup.get_text()
to do this. When I tried doing this on reddit.com, I got this error:
UnicodeEncodeError in soup.py:16
'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence
在我检查的大多数网站上都出现类似的错误.
我也做了 soup.prettify()
时也遇到了类似的错误,但是我通过将其更改为 soup.prettify('UTF-8')
来修复了它.有没有什么办法解决这一问题?预先感谢!
I get errors like that on most of the sites I checked.
I got similar errors when I did soup.prettify()
too, but I fixed it by changing it to soup.prettify('UTF-8')
. Is there any way to fix this? Thanks in advance!
6月24日更新
我发现了一些似乎对其他人有用的代码,但是我仍然需要使用UTF-8而不是默认值.代码:
Update June 24
I've found a bit of code that seems to work for other people, but I still need to use UTF-8 instead of the default. Code:
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('', str(element)): return False
elif re.match('\n', str(element)): return False
return True
visible_texts = filter(visible, texts)
print visible_texts
不过,错误有所不同.进展吗?
Error is different, though. Progress?
UnicodeEncodeError in soup.py:29
'ascii' codec can't encode character u'\xbb' in position 1: ordinal not in range
(128)
推荐答案
soup.get_text()返回Unicode字符串,这就是您收到错误的原因.
soup.get_text() returns a Unicode string that's why you're getting the error.
您可以通过多种方式解决此问题,包括在shell级别上设置编码.
You can solve this in a number of ways including setting the encoding at the shell level.
export PYTHONIOENCODING=UTF-8
您可以重新加载sys并通过将其包含在脚本中来设置编码.
You can reload sys and set the encoding by including this in your script.
if __name__ == "__main__":
reload(sys)
sys.setdefaultencoding("utf-8")
或者您可以在代码中将字符串编码为utf-8.对于您的reddit问题,类似以下的方法将起作用:
Or you can encode the string as utf-8 in code. For your reddit problem something like the following would work:
import urllib
from bs4 import BeautifulSoup
url = "https://www.reddit.com/r/python"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# get text
text = soup.get_text()
print(text.encode('utf-8'))
这篇关于将汤.get_text()与UTF-8一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!