问题描述
我有一个包含亚马逊产品链接的大量 url 列表,这些 url 包含我需要的信息,称为 ASIN 编号.
我知道提取该信息的最佳方法之一是通过正则表达式,我在网址中找到了一种可以提供帮助的模式
3- https://www.amazon.com/adidas-Game-Mode-Polo-Multi-Sport/gp/B07R23QGH6/ref=sr_1_fkmr2_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr> 相应的 ASIN 编号为: 1- B07P4LVZNL,位于:dp/B07P4LVZNL/ref=sr_1_f 2- B07DXPN7TK,位于:dp/B07DXPN7TK/ref=sr_1_fkmr2_ 3- B07R23QGH6,位于:gp/B07R23QGH6/ref=sr_1_fkmr2_ 我试过这个代码: href 是我存储 url 的变量 不过……效果不太好,这就是我得到的结果: 感谢您的帮助 我建议使用 它匹配 I have a huge list of urls with links to Amazon products, this urls have an information contained within that I need that is called ASIN number. I understand that one of the best ways to extract that information is via Regular Expressions, I found a pattern in the urls that could help The respective ASIN numbers are: 1- B07P4LVZNL, located between: dp/B07P4LVZNL/ref=sr_1_f 2- B07DXPN7TK, located between: dp/B07DXPN7TK/ref=sr_1_fkmr2_ 3- B07R23QGH6, located between: gp/B07R23QGH6/ref=sr_1_fkmr2_ I tried this code: href is the variable where I have stored the urls But well... It doesn't work quite well, this is the type of result I get: Thank you for your help I suggest using It matches See the regex demo. In Python: 这篇关于从 URL、RE、python 中提取 Amzon ASIN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!asin = re.match("http[s]?://www.amazon.com(\w+)(.*)/(dp|gp/product)/(?P\w+).*", href, flags=re.IGNORECASE)
<re.Match 对象;span=(0, 171), match='https://www.amazon.com/adidas-Game-Mode-Polo-Mult><re.Match 对象;span=(0, 167), match='https://www.amazon.com/adidas-Tech-Tee-Black-X-La>
/[dg]p/([^/]+)
/dp/
或/gp/
,然后将除/
之外的任何一个或多个字符捕获到Group 1中.>asin = re.search(r'/[dg]p/([^/]+)', href, flags=re.IGNORECASE)如果 asin:打印(asin.group(1))
asin = re.match("http[s]?://www.amazon.com(\w+)(.*)/(dp|gp/product)/(?P<asin>\w+).*", href, flags=re.IGNORECASE)
<re.Match object; span=(0, 175), match='https://www.amazon.com/adidas-Originals-Solid-Mel>
<re.Match object; span=(0, 171), match='https://www.amazon.com/adidas-Game-Mode-Polo-Mult>
<re.Match object; span=(0, 167), match='https://www.amazon.com/adidas-Tech-Tee-Black-X-La>
/[dg]p/([^/]+)
/dp/
or /gp/
and then captures into Group 1 any one or more characters other than /
.asin = re.search(r'/[dg]p/([^/]+)', href, flags=re.IGNORECASE)
if asin:
print(asin.group(1))