本文介绍了分解HTML链接的文本和目标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
由于像
<a href="urltxt" class="someclass" close="true">texttxt</a>
我怎么能隔离URL和文字?
how can I isolate the url and the text?
更新
我用美丽的汤,而我无法弄清楚如何做到这一点。
I'm using Beautiful Soup, and am unable to figure out how to do that.
我做
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
links = soup.findAll('a')
for link in links:
print "link content:", link.content," and attr:",link.attrs
我得到
*link content: None and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root /support.asp')]* ...
...
为什么我缺少的内容?
Why am i missing the content?
编辑:阐述了坚持'作为建议:)
edit: elaborated on 'stuck' as advised :)
推荐答案
使用。自己做起来比看起来难,你会更好使用久经考验的模块。
Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.
编辑:
我觉得你想要的:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
顺便说一句,这是一个坏主意,尝试打开URL那里,如果它出了问题就可以得到难看。
By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.
编辑2:
这将显示在页面中的所有链接:
This should show you all the links in a page:
import urlparse, urllib
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source)
for item in soup.fetchall('a'):
try:
link = urlparse.urlparse(item['href'].lower())
except:
# Not a valid link
pass
else:
print link
这篇关于分解HTML链接的文本和目标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!