我是bs4的新手,我正试图为大学分配的内容刮擦一些有关亚马逊产品的信息,特别是我正试图从html页面提取产品类别。我试图以这种方式提取它,但得到一个空数组。
我需要提取:杂货和美食,糖果和巧克力,软糖和软糖,欧亚甘草
这是我要抓取的网页的一部分,但是我不知道如何访问它:
<div id="wayfinding-breadcrumbs_container" class="a-section a-spacing-none a-padding-medium breadcrumb-fst-exp-1 fst-breadcrumb-feature">
<ul class="a-unordered-list a-horizontal a-size-small">
<li><span class="a-list-item">
<a class="a-link-normal" href="/grocery-breakfast-foods-snacks-organic/b/ref=dp_bc_aui_T1_1?ie=UTF8&node=16310101">
Grocery & Gourmet Food
</a>
</span></li>
<li><span class="a-list-item">
<a class="a-link-normal" href="/Candy-Chocolate/b/ref=dp_bc_aui_T1_2?ie=UTF8&node=16322461">
Candy & Chocolate
</a>
</span></li>
<li><span class="a-list-item">
<a class="a-link-normal" href="/b/ref=dp_bc_aui_T1_3?ie=UTF8&node=17369013011">
Jelly Beans & Gummy Candy
</a>
</span></li>
<li><span class="a-list-item">
<a class="a-link-normal" href="/Licorice-Candy/b/ref=dp_bc_aui_T1_4?ie=UTF8&node=16322521">
Licorice
</a>
</span></li>
</ul>
</div>
import requests
from bs4 import BeautifulSoup
url = "https://www.amazon.com/dp/" + 'B001GVISJM'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
for divtag in soup.find_all("div", attr={"id" : "wayfinding-breadcrumbs_container"}):
print(divtag)
最佳答案
您可以像下面这样。为了通过id查找,您可以将其作为函数参数传递,而不是在attrs
内部传递。
from bs4 import BeautifulSoup
t = '''
<div id="wayfinding-breadcrumbs_container" class="a-section a-spacing-none a-padding-medium breadcrumb-fst-exp-1 fst-breadcrumb-feature">
<ul class="a-unordered-list a-horizontal a-size-small">
<li><span class="a-list-item">
<a class="a-link-normal" href="/grocery-breakfast-foods-snacks-organic/b/ref=dp_bc_aui_T1_1?ie=UTF8&node=16310101">
Grocery & Gourmet Food
</a>
</span></li>
<li><span class="a-list-item">
<a class="a-link-normal" href="/Candy-Chocolate/b/ref=dp_bc_aui_T1_2?ie=UTF8&node=16322461">
Candy & Chocolate
</a>
</span></li>
<li><span class="a-list-item">
<a class="a-link-normal" href="/b/ref=dp_bc_aui_T1_3?ie=UTF8&node=17369013011">
Jelly Beans & Gummy Candy
</a>
</span></li>
<li><span class="a-list-item">
<a class="a-link-normal" href="/Licorice-Candy/b/ref=dp_bc_aui_T1_4?ie=UTF8&node=16322521">
Licorice
</a>
</span></li>
</ul>
</div>
'''
soup = BeautifulSoup(t, 'html.parser')
for divtag in soup.find_all(id="wayfinding-breadcrumbs_container"):
for d in divtag.find_all(attrs={'class': 'a-link-normal'}):
print(d.get_text().strip())
输出:
Grocery & Gourmet Food
Candy & Chocolate
Jelly Beans & Gummy Candy
Licorice
关于python - Beautifulsoup4错误,选择多个属性,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56449078/