问题描述
我正在尝试解析此html以获得商品标题(例如,Big Boss空气炸锅-健康的1300瓦超大型16夸脱,油炸锅5色-新)
I am trying to parse this html to get the item title (e.g. Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW)
<div style="" class="">
<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about </span>Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW</h1>
<h2 id="subTitle" class="it-sttl">
Brand New + Free Shipping, Satisfaction Guaranteed! </h2>
<!-- DO NOT change linkToTagId="rwid" as the catalog response has this ID set -->
<div class="vi-hdops-three-clmn-fix">
<div style="" class="vi-notify-new-bg-wrapper">
<div class="vi-notify-new-bg-dTop" style=""> </div>
<div id="vi_notification_new" class="vi-notify-new-bg-dBtm" style="top: -28px;">
<img src="https://ir.ebaystatic.com/rs/v/tnj4p1myre1mpff12w4j1llndmc.png" width="11" height="12" class="vi-notify-new-img" alt="Popular">
<span style="font-weight:bold;">5 sold in last 24 hours</span>
</div>
</div>
</div>
</div>
我正在使用以下代码来解析页面
I am using the following code to parse the page
url1 = "https://www.ebay.com/itm/Big-Boss-Air-Fryer-Healthy-1300-Watt-Super-Sized-16-Quart-Fryer-5-Colors-NEW/122454150244? epid=2254405949&hash=item1c82d60c64:m:mqfT2XbgveSevmN5MV1iysg"
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item in soup.findAll('h1', {'class':'it-ttl'}):
print(item.string) # Use item.text
get_single_item_data(url1)
当我这样做时,beautifulsoup返回"None".
When I do this, beautifulsoup return 'None'.
我发现的一个解决方案是改用print(item.text),但是现在我得到了这个``关于大老板空气炸锅的详细信息-健康的1300瓦超大型16夸脱,油炸锅5色-新''(我愿意不需要有关"的详细信息.)
One solution I found is to use print(item.text) instead, but now I get this 'Details about Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW'(I do not want 'Details about ').
是否有一种有效的方法来获取项目标题,而不必获取文本,然后取消关于"的详细信息?
Is there an efficient way to get the item title without having to get the text and then taking off the 'Details about '?
推荐答案
这是由于.string
属性的这一警告:
This is because of this caveat of the .string
attribute:
由于header元素包含多个子元素-无法定义,并且默认为None
.
Since the header element contains multiple children - it cannot be defined and defaults to None
.
为避免削减详细信息"部分,您可以采用非递归模式获得第一个文本节点:
To avoid cutting of "Details about" part, you can get the first text node in a non-recursive mode:
soup.find('h1', {'class':'it-ttl'}).find(text=True, recursive=False)
演示:
In [3]: soup = BeautifulSoup(data, "html.parser")
In [4]: print(soup.find('h1', {'class':'it-ttl'}).find(text=True, recursive=False))
Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW
这篇关于使用beautifulsoup有效地解析字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!