本文介绍了间隔输出beautifulsoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我试着去放弃一个网站的内容。然而,在输出即时得到多余的空间,因此即时通讯不能够跨preT这输出。 IM使用一个简单的code:
Im trying to scrap the contents of a website. However in the output im getting unwanted spaces and hence im not able to interpret this output. Im using a simple code :
import urllib2
from bs4 import BeautifulSoup
html= 'http://idlebrain.com/movie/archive/index.html'
soup = BeautifulSoup(urllib2.urlopen(html).read())
print(soup.prettify(formatter=None))
OUTPUT::(输出非常大,因此它以了解哪些问题IM面临的)的一小部分。
OUTPUT::(output is very large so a small part of it in order to understand what problem im facing)
<html><head><title>Telugu cinema reviews by Jeevi - idlebrain.com</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
</head><bodybgcolor="#FFFFFF" leftmargin="0" marginheight="0" marginwidth="0" topmargin="0"><table border="0" cellpadding="0" cellspacing="0" width="96%">
<tr>
<td align="left"> <img alt="Idlebrain.Com" height="63" src="../../image/vox_r01_c2.gif"width="264"/></td>
<td><div align="right"><script type="text/javascript"><!--
g o o g l e _ a d _ c l i e n t = " c a - p u b - 8 8 6 3 7 1 8 7 5 2 0 4 9 7 3 9 " ;
/ * r e v i e w s - h o r * /
g o o g l e _ a d _ s l o t = " 1 6 4 8 6 2 0 2 7 3 " ;
g o o g l e _ a d _ w i d t h = 7 2 8 ;
g o o g l e _ a d _ h e i g h t = 9 0 ;
/ / - - >
< / s c r i p t >
< s c r i p t t y p e = " t e x t / j a v a s c r i p t "
s r c = " h t t p : / / p a g e a d 2 . g o o g l e s y n d i c a t i o n . c o m / p a g e a d / s h o w _ a d s . j s " >
< / s c r i p t >
< / d i v >
< / t d >
< / t r >
< / t a b l e >
< t a b l e w i d t h = " 9 6 % " b o r d e r = " 0 " c e l l s p a c i n g = " 0 " c e l l p a d d i n g = " 0 " >
< t r >
< t d w i d t h = " 1 2 8 " v a l i g n = " t o p " a l i g n = " l e f t " >
< t a b l e b o r d e r = " 0 " c e l l p a d d i n g = " 0 " c e l l s p a c i n g = " 0 " w i d t h = " 1 1 9 " >
< / t r >
< / t a b l e >
< / b o d y >
< / h t m l >
</script></div></td></tr></table></body></html>
谢谢!!!!
推荐答案
您可以指定解析器 html.parser
:
soup = BeautifulSoup(urllib2.urlopen(html).read(), 'html.parser')
或者你也可以指定 HTML5
解析:
soup = BeautifulSoup(urllib2.urlopen(html).read(), 'html5')
还没有安装 HTML5
解析器吗?通过命令行安装:
Haven't installed the html5
parser yet? Install it from command-line:
sudo apt-get install python-html5lib
您也可以使用 XML
分析器,但您可能会看到的的如类=富巴
:
Also you may use the xml
parser but you may see some differences in multi-valued attributes like class="foo bar"
:
soup = BeautifulSoup(urllib2.urlopen(html).read(), 'xml')
这篇关于间隔输出beautifulsoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!