我已经解析了html页面:使用beautifulsoup
user_page = urllib2.urlopen(user_url)
souping_page = bs(user_page)
badges = souping_page.body.find('div', attrs={'class': 'badges'})
之后,我的
badges
对象如下所示:<span><span title="9 gold badges"><span class="badge1"></span><span class="badgecount">9</span></span><span title="38 silver badges"><span class="badge2"></span><span class="badgecount">38</span></span><span title="56 bronze badges"><span class="badge3"></span><span class="badgecount">56</span></span></span>
现在,我想从中提取示例
9 gold badges
,38 silver badges
,我尝试使用badges.span.span
,但这不起作用。 最佳答案
从span
获取父badges
,通过将find_all()
与recursive=False
一起使用,查找其中的所有顶级范围:
from bs4 import BeautifulSoup
page = """<div class="badges">
<span>
<span title="9 gold badges"><span class="badge1"></span><span class="badgecount">9</span></span>
<span title="38 silver badges"><span class="badge2"></span><span class="badgecount">38</span></span>
<span title="56 bronze badges"><span class="badge3"></span><span class="badgecount">56</span></span>
</span>
</div>"""
soup = BeautifulSoup(page)
badges = soup.body.find('div', attrs={'class': 'badges'})
for span in badges.span.find_all('span', recursive=False):
print span.attrs['title']
打印:
9 gold badges
38 silver badges
56 bronze badges
希望能有所帮助。