我正在尝试使用 BeautifulSoup 进行网页抓取,我需要从此 webpage 中提取标题,特别是从“更多”标题部分。这是我迄今为止尝试使用的代码。

import requests
from bs4 import BeautifulSoup
from csv import writer

response = requests.get('https://www.cnbc.com/finance/?page=1')

soup = BeautifulSoup(response.text,'html.parser')

posts = soup.find_all(id='pipeline')

for post in posts:
    data = post.find_all('li')
    for entry in data:
        title = entry.find(class_='headline')
        print(title)

运行此代码以以下输出格式为我提供页面中的所有标题:
<div class="headline">
<a class=" " data-nodeid="105372063" href="/2018/08/02/after-apple-rallies-to-1-trillion-even-the-uber-bullish-crowd-on-wal.html">
           {{{*HEADLINE TEXT HERE*}}}
</a> </div>

但是,如果我在上面的代码中获取标题时使用 get_text() 方法,我只会得到前两个标题。
title = entry.find(class_='headline').get_text()

随后出现此错误:
Traceback (most recent call last):
  File "C:\Users\Tanay Roman\Documents\python projects\scrapper.py", line 16, in <module>
    title = entry.find(class_='headline').get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

为什么添加 get_text() 方法只返回部分结果。我该如何解决?

最佳答案

您误解了错误消息。不是 .get_text() 调用返回一个 NoneType 对象,而是 NoneType 类型的对象没有那个方法。

只有一个 NoneType 类型的对象,即值 None 。在这里它是由 entry.find(class_='headline') 返回的,因为它在 entry 中找不到与搜索条件匹配的元素。换句话说,对于那个 entry 元素,没有类 headline 的子元素。

有两个这样的 <li> 元素,一个带有 id nativedvriver3 ,另一个带有 nativedvriver9 ,并且两者都会出现该错误。您需要先检查是否有匹配的元素:

for entry in data:
    headline = entry.find(class_='headline')
    if headline is not None:
        title = headline.get_text()

如果您使用 CSS selector ,您的时间会容易得多:
headlines = soup.select('#pipeline li .headline')
for headline in headlines:
    headline_text = headline.get_text(strip=True)
    print(headline_text)

这产生:
>>> headlines = soup.select('#pipeline li .headline')
>>> for headline in headlines:
...     headline_text = headline.get_text(strip=True)
...     print(headline_text)
...
Hedge funds fight back against tech in the war for talent
Goldman Sachs sees more price pain ahead for bitcoin
Dish Network shares rise 15% after subscriber losses are less than expected
Bitcoin whale makes ‘enormous’ losing bet, so now other traders have to foot the bill
The 'Netflix of fitness' looks to become a publicly traded stock as soon as next year
Amazon slammed for ‘insult’ tax bill in the UK despite record profits
Nasdaq could plunge 15 percent or more as ‘rolling bear market’ grips stocks: Morgan Stanley
Take-Two shares surge 9% after gamemaker beats expectations due to 'Grand Theft Auto Online'
UK bank RBS announces first dividend in 10 years
Michael Cohen reportedly secured a $10 million deal with Trump donor to advance a nuclear project
After-hours buzz: GPRO, AIG & more
Bitcoin is still too 'unstable' to become mainstream money, UBS says
Apple just hit a trillion but its stock performance has been dwarfed by the other tech giants
The first company to ever reach $1 trillion in market value was in China and got crushed
Apple at a trillion-dollar valuation isn’t crazy like the dot-com bubble
After Apple rallies to $1 trillion, even the uber bullish crowd on Wall Street believes it may need to cool off

关于python-3.x - BeautifulSoup get_text 返回 NoneType 对象,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/51687872/

10-12 22:18