本文介绍了对于某些部分只能从BlogSpot的使用BeautifulSoup提取链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我试图提取某些部分只能从Blogspot的链接。但是,输出显示codeS提取网页中的所有链接。
I am trying to extract links for certain section only from Blogspot. But the output shows the codes extract all the link inside the page.
下面是codeS:
import urlparse
import urllib
from bs4 import BeautifulSoup
url = "http://ellywonderland.blogspot.com/"
urls = [url]
visited = [url]
while len(urls) >0:
try:
htmltext = urllib.urlopen(urls[0]).read()
except:
print urls[0]
soup = BeautifulSoup(htmltext)
urls.pop(0)
print len (urls)
for tags in soup.find_all(attrs={'class': "post-title entry-title"}):
for tag in soup.findAll('a',href=True):
tag['href'] = urlparse.urljoin(url,tag['href'])
if url in tag['href'] and tag['href'] not in visited:
urls.append(tag['href'])
visited.append(tag['href'])
print visited
下面是HTML codeS为节我想提取:
Here is the html codes for section that I want to extract:
<h3 class="post-title entry-title" itemprop="name">
<a href="http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html">Pre-wedding * Vintage*</a>
感谢您。
推荐答案
如果你不一定需要使用 BeautifulSoup
我认为这将是更容易做到像这样的:
If you don't necessarily need to use BeautifulSoup
I think it would be easier to do something like this:
import feedparser
url = feedparser.parse('http://ellywonderland.blogspot.com/feeds/posts/default?alt=rss')
for x in url.entries:
print str(x.link)
输出:
http://ellywonderland.blogspot.com/2011/03/my-vintage-pre-wedding.html
http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html
http://ellywonderland.blogspot.com/2010/12/tissue-paper-flower-crepe-paper.html
http://ellywonderland.blogspot.com/2010/12/menguap-menurut-islam.html
http://ellywonderland.blogspot.com/2010/12/weddings-idea.html
http://ellywonderland.blogspot.com/2010/12/kawin.html
http://ellywonderland.blogspot.com/2010/11/vitamin-c-collagen.html
http://ellywonderland.blogspot.com/2010/11/port-dickson.html
http://ellywonderland.blogspot.com/2010/11/ellys-world.html
可以解析的blogspot页面的RSS提要并且可以返回所需的数据,在此情况下在的href
的文章标题。
feedparser can parse the RSS feed of the blogspot page and can return the data you want, in this case the href
for the post titles.
这篇关于对于某些部分只能从BlogSpot的使用BeautifulSoup提取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!