我尝试在亚马逊上抓取ASIN号码。请注意,这与产品详细信息无关(例如:https://www.youtube.com/watch?v=qRVRIh3GZgI),但这是在您搜索关键字时(在本示例中为“trimmer”,请尝试以下操作):
https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2)。结果是很多产品,我能够刮所有标题。
不可见的是ASIN(这是一个唯一的Amazon号)。在检查HTML时,我看到了文本(href)中的链接,其中包含ASIN编号。在下面的示例中,ASIN = B01MSHQ5IQ
<a class="a-link-normal a-text-normal" href="/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ/ref=sr_1_3?keywords=trimmer&qid=1554462204&s=gateway&sr=8-3">
以我的问题结尾:如何获取页面上的所有产品标题和ASIN号? 例如:
Number Title ASIN
1 Braun, Beardtrimmer B07JH1LLYR
2 TNT Pro Series Waist B00R84J2PK
... ... ...
到目前为止,我正在使用scrapy(但也适用于其他Python解决方案),并且能够抓取Titles。
到目前为止,我的代码:
首先在命令行中运行:
scrapy startproject tutorial
然后,调整Spider中的文件(请参见示例1)和items.py(请参见示例2)。
例子1
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
#Use working product URL below
start_urls = [
"https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"
]
## scrapy crawl AmazonDeals -o Asin_Titles.json
def parse(self, response):
items = AmazonItem()
Title = response.css('.a-text-normal').css('::text').extract()
items['title_Products'] = Title
yield items
按照@glhr的要求,添加 items.py 代码:
例子2
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Products = scrapy.Field()
最佳答案
您可以通过提取href
的<a class="a-link-normal a-text-normal" href="...">
属性来获得产品的链接:
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
在链接中,您可以使用正则表达式从链接中提取ASIN号:
(?<=dp/)[A-Z0-9]{10}
上面的正则表达式将匹配10个字符(大写字母或数字),然后加上
dp/
。在此处查看演示:https://regex101.com/r/mLMv3k/1这是
parse()
方法的有效实现:def parse(self, response):
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
Title = response.css('span.a-text-normal').css('::text').extract()
# for each product, create an AmazonItem, populate the fields and yield the item
for result in zip(Link,Title):
item = AmazonItem()
item['title_Product'] = result[1]
item['link_Product'] = result[0]
# extract ASIN from link
ASIN = re.findall(r"(?<=dp/)[A-Z0-9]{10}",result[0])[0]
item['ASIN_Product'] = ASIN
yield item
这需要用新字段扩展
AmazonItem
:class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Product = scrapy.Field()
link_Product = scrapy.Field()
ASIN_Product = scrapy.Field()
样本输出:
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multigroom Series 3000, 13 attachments, '
'FFP, MG3750'}
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multi Groomer MG7750/49-23 piece, beard, '
'body, face, nose, and ear hair trimmer, shaver, and clipper'}
演示:https://repl.it/@glhr/55534679-AmazonSpider
要将输出写入JSON文件,只需在Spider中指定Feed导出设置:
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
start_urls = ["https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"]
custom_settings = {
'FEED_URI' : 'Asin_Titles.json',
'FEED_FORMAT' : 'json'
}
关于python - 从亚马逊的“搜索”页面中刮取ASIN,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/55534679/