问题描述
我正在尝试使用Python的urllib.request
下载一些内容.以下命令会产生异常:
I am trying to download some content using Python's urllib.request
. The following command yields an exception:
import urllib.request
print(urllib.request.urlopen("https://fpgroup.foreignpolicy.com/foreign-policy-releases-mayjune-spy-issue/").code)
结果:
...
HTTPError: HTTP Error 403: Forbidden
如果我使用firefox或链接(命令行浏览器),则会得到内容和状态码200.如果使用lynx,则很奇怪,我也会得到403.
if I use firefox or links (command line browser) I get the content and a status code of 200. If I use lynx, strange enough, I also get 403.
我希望所有方法都能奏效
I expect all methods to work
- 同样的方式
- 成功
为什么不是这样?
推荐答案
该网站最有可能阻止人们抓取其网站.您可以通过包含标头信息以及其他内容来从根本上欺骗他们.有关更多信息,请参见此处.
Most likely the site is blocking people from scraping their sites. You can trick them at a basic level by including header info along with other stuff. See here for more info.
引用自: https://docs.python.org/3/howto /urllib2.html#headers
import urllib.parse
import urllib.request
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
the_page = response.read()
人们不希望脚本抓取其网站的原因有很多.这需要他们的带宽.他们不希望人们通过制造一个抓取机器人来(从金钱上)受益.也许他们不希望您复制他们的网站信息.您也可以将其视为一本书.作者希望人们阅读他们的书,但也许有些人不希望机器人扫描他们的书,创建非副本,或者机器人可能会对其进行汇总.
There are many reasons why people don't want scripts to scrape their websites. It takes their bandwidth for one. They don't want people to benefit (money-wise) by making a scrape bot. Maybe they don't want you to copy their site information. You can also think of it as a book. Authors want people to read their books, but maybe some of them wouldn't want a robot to scan their books, to create an off copy, or maybe the robot might summarize it.
由于有太多自以为是的答案,因此在评论中,问题的第二部分含糊而笼统.
The second part of your question in the comment is to vague and broad to answer here as there are too many opinionated answers.
这篇关于为什么有时urllib.request.urlopen不起作用,但浏览器起作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!