问题描述
我有两个网站,现在我手中有数据,我想用这些数据进行分析
I have two website and i have datas in my hands now i want to do analysis with that data
我有两个产品名称(品牌+产品名称),我只想提取品牌名称
I have two product name(Brand + Product name) i want to extract only brand name
http://www.thehut.com/jeans-clothing/men/clothing/brave-soul-men-s-cardiff-jeans-denim/10741907.html
在上述网站中,产品名称为
In the above website the product name is
Brave Soul Men's Swansea Jeans - Denim
品牌名称是
Brave Soul
所以我只想要
Brave Soul
Amazon网站链接
Amazon weblink
http://www.amazon.in/gp/product/B00L8WT2UI
类似地,在上述网站中,产品名称为
Similarly In the above website the product name is
Apple iPhone 5c (White, 8GB)
品牌名称是
Apple
所以我想要类似
Brave Soul
Apple
推荐答案
您尝试获取的信息实际上并不存在.
The information you're trying to get isn't actually there.
如果您使用两个字符串,两个字符串都可以有任意数量的空格,并将它们与一个空格连接在一起,则不再可能明确地告诉哪个空格将两个字符串连接在一起,以及哪个空格是字符串的一部分
If you take two strings, both of which may have any number of spaces, and join them together with a space, it's no longer possible to tell unambiguously which space was joining the two strings, and which spaces were part of the strings.
因此,您有几种选择:
首先,每个产品中没有 个空格,因此您可以尝试所有可能性:品牌Brave
和产品Soul Men's Swansea Jeans - Denim
,然后是品牌Brave Soul
和产品Men's Swansea Jeans - Denim
,然后是品牌Brave Soul Men's
和产品Swansea Jeans - Denim
,以此类推,以获取其他3种可能性.
First, there aren't that many spaces in each product, so you can just try all the possibilities: Brand Brave
and Product Soul Men's Swansea Jeans - Denim
, then Brand Brave Soul
and Product Men's Swansea Jeans - Denim
, then Brand Brave Soul Men's
and Product Swansea Jeans - Denim
, and so on for the other 3 possibilities.
第二,如果您可以从其他位置抓取所有品牌名称的列表并将其存储在set
(或数据库表等)中,则可以预先过滤可能性,然后在相对较慢的网络中尝试所有可能性向亚马逊提出的要求.例如,如果您有所有品牌的列表,只需检查Brave
,Brave Soul
,Brave Soul Men's
,Brave Soul Men's Swansea
等中的哪一个是实际品牌,然后仅进行测试即可.
Second, if you can scrape a list of all brand names from somewhere else and stash them in a set
(or a database table or whatever), you can pre-filter the possibilities before trying them all in comparatively slow web requests to Amazon. For example, if you have a list of all the brands, just check which among Brave
, Brave Soul
, Brave Soul Men's
, Brave Soul Men's Swansea
, etc. are actual brands, and only test those.
与此同时,这仍然不是完美的,因为几乎可以肯定情况是模棱两可的.例如,有一个品牌Apple
和一个品牌Apple Records
,那么当您尝试拆分Apple Records Master Collection
时会发生什么呢?您有两种有效的可能性,而不仅仅是一种.您所能做的就是设计代码以某种方式处理它(并进行正确的单元测试).
Meanwhile, this still isn't going to be perfect, because there are almost certainly cases that are ambiguous. For example, there's a brand Apple
, and also a brand Apple Records
, so what happens when you try to split up Apple Records Master Collection
? You've got two valid possibilities, not just one. All you can do is design your code to deal with that in some way (and unit test that you did so correctly).
这篇关于如何从产品名称中提取品牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!