本文介绍了Scrapy结果正在重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从该站点获取歌曲名称

解决方案

对于您的蜘蛛来说,一切似乎都是正确的.但是,如果您查看歌曲页面,它会提供每首歌曲的两个版本:

$scrapy shell "https://pagalworld.me/files/12450/Babumoshai%20Bandookbaaz%20(2017)%20Movie%20Mp3%20Songs.html">[1]: response.xpath('//li/b/a/text()').extract()<[1]:['03 Aye Saiyan - Babumoshai Bandookbaaz 190Kbps.mp3','03 Aye Saiyan - Babumoshai Bandookbaaz 320Kbps.mp3','01 Barfani - 男 (Armaan Malik) 190Kbps.mp3','01 Barfani - 男 (Armaan Malik) 320Kbps.mp3','02 Barfani - 女 (Orunima Bhattacharya) 190Kbps.mp3','02 Barfani - 女 (Orunima Bhattacharya) 320Kbps.mp3']

一个版本的 190kbps 质量较低,另一个版本的 320kbps 质量较高.
在这里,您可能只想保留其中之一:

>[2]: response.xpath('//li/b/a/text()[contains(.,"320Kb")]').extract()<[2]:['03 Aye Saiyan - Babumoshai Bandookbaaz 320Kbps.mp3','01 Barfani - 男 (Armaan Malik) 320Kbps.mp3','02 Barfani - 女 (Orunima Bhattacharya) 320Kbps.mp3']

编辑:好像也有重复的问题.尝试在链接提取器上禁用 follow=True,因为在这种情况下您不想关注.

I am trying to get names of the songs from this site https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html using link extractor but the results are repeating.

import scrapy
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class RedditSpider(CrawlSpider):
    name='pagalworld'
    allowed_domains = ["pagalworld.me"]
    start_urls=['https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html']
    rules = (
        Rule(
        LinkExtractor(restrict_xpaths='//div/ul'),
        follow=True,
        callback='parse_start_url'),
    )
    def parse_start_url(self, response):
        songName= response.xpath('//li/b/a/text()').extract()

        for item in songName:

            yield {"songName":item,
        "URL":resposne}
解决方案

Everything seems to be correct with your spider. However if you look at the song page it offers two versions of each song:

$ scrapy shell "https://pagalworld.me/files/12450/Babumoshai%20Bandookbaaz%20(2017)%20Movie%20Mp3%20Songs.html"
>[1]: response.xpath('//li/b/a/text()').extract()
<[1]:
['03 Aye Saiyan - Babumoshai Bandookbaaz 190Kbps.mp3',
 '03 Aye Saiyan - Babumoshai Bandookbaaz 320Kbps.mp3',
 '01 Barfani - Male (Armaan Malik) 190Kbps.mp3',
 '01 Barfani - Male (Armaan Malik) 320Kbps.mp3',
 '02 Barfani - Female (Orunima Bhattacharya) 190Kbps.mp3',
 '02 Barfani - Female (Orunima Bhattacharya) 320Kbps.mp3']

One version is lower 190kbps quality and the other is higher 320kbps quality.
In this you probably want just to keep one of those:

>[2]: response.xpath('//li/b/a/text()[contains(.,"320Kb")]').extract()
<[2]:
['03 Aye Saiyan - Babumoshai Bandookbaaz 320Kbps.mp3',
 '01 Barfani - Male (Armaan Malik) 320Kbps.mp3',
 '02 Barfani - Female (Orunima Bhattacharya) 320Kbps.mp3']

Edit:Seems like there are also duplication issues. Try disabling follow=True on your link extractor since in this case you don't want to follow.

这篇关于Scrapy结果正在重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-27 07:53