问题描述
我正在尝试从Amazon产品页面上的特定表元素中抓取特定文本.
I am trying to scrape specific text from specific table elements on an Amazon product page.
URL_1具有所有元素- https://www. amazon.com/dp/B008Q5LXIE/URL_2仅具有销售排名"- https://www. amazon.com/dp/B001V9X26S
URL_1 has all elements - https://www.amazon.com/dp/B008Q5LXIE/URL_2 has only 'Sales Rank' - https://www.amazon.com/dp/B001V9X26S
URL_1:产品详细信息"表有9个项目,我只对产品尺寸",运输重量",项目型号和所有卖方排名"感兴趣.
URL_1:The "Product Details" table has 9 items and I am only interested in 'Product Dimensions', 'Shipping Weight', Item Model Number, and all 'Seller's Rank'
我无法解析这些项目上的文本,因为某些项目位于一个代码块中,而其他代码则不在.
I am not able to parse out the text on these items as some are in one block of code, where others are not.
我正在使用beautifulsoup,并且我已经在桌子上运行了一个text.strip(),除了杂乱无章之外,一切都得到了实现.我已经尝试过soup.find('li')和text.strip()来查找单个元素,但是在卖方排名的情况下,它返回的所有3个排名在一次返回中都是混杂的.我也尝试过使用正则表达式来清理文本,但不适用于4个不同的卖方等级.我已经成功地使用尝试,排除,通过"方法进行了抓取,并且每种格式都采用这种格式
I am using beautifulsoup and I have run a text.strip() on the table and got everything but very messy. I have tried soup.find('li') and text.strip() to find individual elements but with seller rank, it returns all 3 ranks jumbled in one return. I have also tried regex to clean text but it won't work for the 4 different seller ranks. I have had success using the Try, Except, Pass method for scraping and would have each of these in that format
A bad example of the code used, I was trying to get sales rank past the </b>
element in the HTML
#Sales Rank
sales_rank ='NOT'
try:
sr = soup.find('li', attrs={'id':'SalesRank'})
sales_rank = sr.find('/b').text.strip()
except:
pass
我希望能够将列出的元素抓取到字典中.我希望结果显示为
I expect to be able to scrape the listed elements into a dictionary. I would like to see the results as
dimensions = 6x4x4
weight = 4.8 ounces
Item_No = IT-DER0-IQDU
R1_NO = 2,036
R1_CAT = Health & Household
R2_NO = 5
R2_CAT = Joint & Muscle Pain Relief Medications
R3_NO = 3
R3_CAT = Naproxen Sodium
R4_NO = 6
R4_CAT = Migraine Relief
my_dict = {'dimensions':'dimensions','weight':'weight','Item_No':'Item_No', 'R1_NO':R1_NO,'R1_CAT':'R1_CAT','R2_NO':R2_NO,'R2_CAT':'R2_CAT','R3_NO':R3_NO,'R3_CAT':'R3_CAT','R4_CAT':'R4_CAT'}
URL_2:页面上唯一感兴趣的元素是销售排名".不存在产品尺寸",运输重量",项目型号.但是,我希望返回类似于URL_1的返回值,但是缺少的元素的值将为'NA'.结果与URL_1相同,不存在元素时仅给出"NA".我通过在Try/Except语句之前设置一个值来成功完成此任务.例如:装运重量='NA'...然后运行try/except:pass,所以我得到'NA'并且我的字典不是空的.
URL_2:The only element of interest on page is 'Sales Rank'. 'Product Dimensions', 'Shipping Weight', Item Model Number are not present. However, I would like a return similar to that of URL_1 but the missing elements would have a value of 'NA'. Same results as URL_1, only 'NA' is given when an element is not present. I have had success accomplishing this by setting a value prior to the Try/Except statement. Ex: Shipping Weight = 'NA' ... then run try/except: pass , so I get 'NA' and my dictionary is not empty.
推荐答案
可以在bs4 4.7.1中使用stripped_strings和:contains.要获得所需的输出格式,这感觉就像是一堆曲折的扑克.确保拥有更多python经验的人可以减少这种情况并提高其效率.合并来自 @aaronhall 的dicts语法.
You could use stripped_strings and :contains with bs4 4.7.1. This feels like a lot of jiggery pokery to get the desired output format. Sure someone with more python experience could reduce this and improve its efficiency. Merging dicts syntax taken from @aaronhall.
import requests
from bs4 import BeautifulSoup as bs
import re
links = ['https://www.amazon.com/Professional-Dental-Guard-Remoldable-Customizable/dp/B07L4YHBQ4', 'https://www.amazon.com/dp/B0040ODFK4/?tag=stackoverfl08-20']
for link in links:
r = requests.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Amazon Best Sellers Rank']
temp_dict = {}
for field in fields:
element = soup.select_one('li:contains("' + field + '")')
if element is None:
temp_dict[field] = 'N/A'
else:
if field == 'Amazon Best Sellers Rank':
item = [re.sub('#|\(','', string).strip() for string in soup.select_one('li:contains("' + field + '")').stripped_strings][1].split(' in ')
temp_dict[field] = item
else:
item = [string for string in element.stripped_strings][1]
temp_dict[field] = item.replace('(', '').strip()
ranks = soup.select('.zg_hrsr_rank')
ladders = soup.select('.zg_hrsr_ladder')
if ranks:
cat_nos = [item.text.split('#')[1] for item in ranks]
else:
cat_nos = ['N/A']
if ladders:
cats = [item.text.split('\xa0')[1] for item in soup.select('.zg_hrsr_ladder')]
else:
cats = ['N/A']
rankings = dict(zip(cat_nos, cats))
map_dict = {
'Product Dimensions': 'dimensions',
'Shipping Weight': 'weight',
'Item model number': 'Item_No',
'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']
}
final_dict = {}
for k,v in temp_dict.items():
if k == 'Amazon Best Sellers Rank' and v!= 'N/A':
item = dict(zip(map_dict[k],v))
final_dict = {**final_dict, **item}
elif k == 'Amazon Best Sellers Rank' and v == 'N/A':
item = dict(zip(map_dict[k], [v, v]))
final_dict = {**final_dict, **item}
else:
final_dict[map_dict[k]] = v
for k,v in enumerate(rankings):
#print(k + 1, v, rankings[v])
prefix = 'R' + str(k + 2) + '_'
final_dict[prefix + 'NO'] = v
final_dict[prefix + 'CAT'] = rankings[v]
print(final_dict)
这篇关于如何从特定的表格元素中抓取特定的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!