问题描述
我正在尝试使用 googlesearch 和 news3k python 包的组合来获取文章列表.使用 article.parse 时,我最终得到一个错误:news.article.ArticleException: Article download()
failed with 403 Client Error: Forbidden for url: https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697 在 URL https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697
I am trying to get a list of articles using a combo of the googlesearch and newspaper3k python packages. When using article.parse, I end up getting an error: newspaper.article.ArticleException: Article download()
failed with 403 Client Error: Forbidden for url: https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697 on URL https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697
我在执行脚本时尝试以管理员身份运行,并且链接在浏览器中直接打开时有效.
I have tried running as admin when executing script and the link works when opening straight in a browser.
这是我的代码:
import googlesearch
from newspaper import Article
query = "trump"
urlList = []
for j in googlesearch.search_news(query, tld="com", num=500, stop=200, pause=.01):
urlList.append(j)
print(urlList)
articleList = []
for i in urlList:
article = Article(i)
article.download()
article.html
article.parse()
articleList.append(article.text)
print(article.text)
这是我的完整错误输出:
Here is my full error output:
Traceback (most recent call last):
File "C:/Users/andre/PycharmProjects/StockBot/WebCrawlerTest.py", line 31, in <module>
article.parse()
File "C:UsersandreAppDataLocalProgramsPythonPython37libsite-packages
ewspaperarticle.py", line 191, in parse
self.throw_if_not_downloaded_verbose()
File "C:UsersandreAppDataLocalProgramsPythonPython37libsite-packages
ewspaperarticle.py", line 532, in throw_if_not_downloaded_verbose
(self.download_exception_msg, self.url))
newspaper.article.ArticleException: Article `download()` failed with 403 Client Error: Forbidden for url: https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697 on URL https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697
我希望它只输出文章的文本.你能提供的任何帮助都会很棒.谢谢!
I expected it to just output the text of the article. Any help you can give would be great. Thanks!
推荐答案
我通过更改用户代理使其工作
I got it to work by changing the user-agent
from newspaper import Article
from newspaper import Config
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
page = Article("https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697", config=config)
page.download()
page.parse()
print(page.text)
这篇关于如何修复某些 URL 的 Newspaper3k 403 客户端错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!