这是我从刮取的网站上复制的一些链接。问题在于,站点地图中的一些主要类别不止一次出现,例如:“时尚”,“视听”和“计算机服务器”。但是我只需要这些链接一次。我怎样才能做到这一点,我用var“ counter”来检查第二次出现,但这也没有帮助。

<a href="http://www.example.com/networking-storage">Networking Storage</a>
<a href="http://www.example.com/mobiles-tablets">Mobiles Tablets</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/printers-scanners">Printers Scanners</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/cameras">Cameras</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>


这是我的python代码来获取这些链接:

mainPage = requests.get("http://www.example.com/catalog/seo_sitemap/category/?p=1")
mainTree = html.fromstring(mainPage.text)

for mainCat in mainTree.cssselect('a'):
    print (mainCat.get('href'))


它打印-

http://www.example.com/mobiles-tablets
http://www.example.com/fashion
http://www.example.com/fashion
http://www.example.com/printers-scanners
http://www.example.com/audio-visual
http://www.example.com/audio-visual
http://www.example.com/cameras
http://www.example.com/computers-servers
http://www.example.com/computers-servers


虽然我需要这样:

http://www.example.com/mobiles-tablets
http://www.example.com/fashion
http://www.example.com/printers-scanners
http://www.example.com/audio-visual
http://www.example.com/cameras
http://www.example.com/computers-servers

最佳答案

下面的代码为我工作-

import requests
from lxml.cssselect import CSSSelector
from lxml import html


s='''<a href="http://www.example.com/mobiles-tablets">Mobiles Tablets</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/printers-scanners">Printers Scanners</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/cameras">Cameras</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>'''


#mainPage = requests.get("http://www.example.com/catalog/seo_sitemap/category/?p=1")
mainTree = html.fromstring(s)

mainTree = html.fromstring(s)
lnks = set([i.get('href') for i in mainTree.cssselect('a')])
for i in lnks:
    print i


它打印-

http://www.example.com/mobiles-tablets
http://www.example.com/printers-scanners
http://www.example.com/fashion
http://www.example.com/audio-visual
http://www.example.com/computers-servers
http://www.example.com/cameras

10-06 06:53