我想从this page中提取新的部分内容,从接下来的几周开始,以一般增强结束。
检查代码时,我看到<span>嵌套在<li>下,然后嵌套在<ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B">下。在过去的几天里,我试图用Python 3和BeautifulSoup提取它,但没有成功。我在粘贴下面尝试过的代码。
有人能帮我指引正确的方向吗?
1个#

from urllib.request import urlopen # open URLs
from bs4 import BeautifulSoup # BS

import sys # sys.exit()

page_url = 'https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS'

try:
    page = urlopen(page_url)
except:
    sys.exit("No internet connection. Program exiting...")

soup = BeautifulSoup(page, 'html.parser')

try:
    for ultag in soup.find_all('ul', {'id': 'GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B'}):
        print(ultag.text)
        for spantag in ultag.find_all('span'):
            print(spantag)
except:
    print("Couldn't get What's new :(")

2个#
from urllib.request import urlopen # open URLs
from bs4 import BeautifulSoup # BS

import sys # sys.exit()

page_url = 'https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS'

try:
    page = urlopen(page_url)
except:
    sys.exit("No internet connection. Program exiting...")

soup = BeautifulSoup(page, 'html.parser')

uls = []
for ul in uls:
    for ul in soup.findAll('ul', {'id': 'GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B'}):
        if soup.find('ul'):
            break
        uls.append(ul)
    print(uls)
    for li in uls:
        print(li.text)

理想情况下,代码应该返回:
在接下来的几周里,你只需点击“出发前”对话框就可以阅读你拥有的项目。
性能改进、错误修复和其他常规增强。
但都没给我什么。它似乎找不到带有该ID的ul,但如果您print(soup),一切看起来都很好:
<ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B">
<li>
<span class="a-list-item"><span><strong>Read Now</strong></span>: In the coming weeks, you will be able to read items that you own with a single click from the �Before You Go� dialog.</span></li>

<li>
<span class="a-list-item">Performance improvements, bug fixes, and other general enhancements.<br></li>


</ul>

最佳答案

对于bs4 4.7.1+,您可以使用:contains和:has to isolate

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS')
soup = bs(r.content, 'lxml')
text = [i.text.strip() for i in soup.select('p:has(strong:contains("Here’s what’s new:")), p:has(strong:contains("Here’s what’s new:")) + p + ul li')]
print(text)

python - 如何使用BeautifulSoup从嵌套在&lt;ul&gt;中的&lt;span&gt;中提取文本?-LMLPHP
目前,您还可以删除:contains
text = [i.text.strip() for i in soup.select('p:has(strong), p:has(strong) + p + ul li')]
print(text)

+是css相邻的兄弟组合符。阅读更多here。引用:
相邻同胞组合子
+combinator选择相邻的兄弟姐妹。这意味着第二个元素直接跟随
首先,两者共享同一个父对象。
语法:A+B
示例:h2 + p将匹配所有<p> elements that directly follow an <h2>

关于python - 如何使用BeautifulSoup从嵌套在<ul>中的<span>中提取文本?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/57725818/

10-12 15:22