这是我从刮取的网站上复制的一些链接。问题在于,站点地图中的一些主要类别不止一次出现,例如:“时尚”,“视听”和“计算机服务器”。但是我只需要这些链接一次。我怎样才能做到这一点,我用var“ counter”来检查第二次出现,但这也没有帮助。
<a href="http://www.example.com/networking-storage">Networking Storage</a>
<a href="http://www.example.com/mobiles-tablets">Mobiles Tablets</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/printers-scanners">Printers Scanners</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/cameras">Cameras</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>
这是我的python代码来获取这些链接:
mainPage = requests.get("http://www.example.com/catalog/seo_sitemap/category/?p=1")
mainTree = html.fromstring(mainPage.text)
for mainCat in mainTree.cssselect('a'):
print (mainCat.get('href'))
它打印-
http://www.example.com/mobiles-tablets
http://www.example.com/fashion
http://www.example.com/fashion
http://www.example.com/printers-scanners
http://www.example.com/audio-visual
http://www.example.com/audio-visual
http://www.example.com/cameras
http://www.example.com/computers-servers
http://www.example.com/computers-servers
虽然我需要这样:
http://www.example.com/mobiles-tablets
http://www.example.com/fashion
http://www.example.com/printers-scanners
http://www.example.com/audio-visual
http://www.example.com/cameras
http://www.example.com/computers-servers
最佳答案
下面的代码为我工作-
import requests
from lxml.cssselect import CSSSelector
from lxml import html
s='''<a href="http://www.example.com/mobiles-tablets">Mobiles Tablets</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/printers-scanners">Printers Scanners</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/cameras">Cameras</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>'''
#mainPage = requests.get("http://www.example.com/catalog/seo_sitemap/category/?p=1")
mainTree = html.fromstring(s)
mainTree = html.fromstring(s)
lnks = set([i.get('href') for i in mainTree.cssselect('a')])
for i in lnks:
print i
它打印-
http://www.example.com/mobiles-tablets
http://www.example.com/printers-scanners
http://www.example.com/fashion
http://www.example.com/audio-visual
http://www.example.com/computers-servers
http://www.example.com/cameras