问题描述
在我的RequestHandler子类中,我试图获取网址范围:
In my subclass of RequestHandler, I am trying to fetch range of urls:
class GetStats(webapp2.RequestHandler):
def post(self):
lastpage = 50
for page in range(1, lastpage):
tmpurl = url + str(page)
response = urllib2.urlopen(tmpurl, timeout=5)
html = response.read()
# some parsing html
heap.append(result_of_parsing)
self.response.write(heap)
但是它可以处理大约30个网址(页面加载时间很长,但是可以).如果超过30个,我会报错:
But it works with ~ 30 urls (page is loading long but it is works).In case more than 30 I am getting an error:
错误:服务器错误
服务器遇到错误,无法完成您的请求.
请在30秒内重试.
有什么方法可以获取很多网址?可能更理想或更合适?多达几百页?
Is there any way to fetch a lot of urls? May be more optimal or smth?Up to several hundreds of pages?
更新:
我正在使用BeautifulSoup解析每个页面.我在gae日志中发现了这种追溯:
I am using BeautifulSoup to parse every single page. I found this traceback in gae logs:
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 68, in post
heap = get_times(tmp_url, 160)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 106, in get_times
soup = BeautifulSoup(html)
File "libs/bs4/__init__.py", line 168, in __init__
self._feed()
File "libs/bs4/__init__.py", line 181, in _feed
self.builder.feed(self.markup)
File "libs/bs4/builder/_htmlparser.py", line 56, in feed
super(HTMLParserTreeBuilder, self).feed(markup)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 155, in goahead
startswith = rawdata.startswith
DeadlineExceededError
推荐答案
之所以失败,是因为您只有60秒的时间将响应返回给用户,而我猜想它花费的时间会更长.
It's failing because you only have 60 seconds to return a response to the user and I'm going to guess it's taking longer then that.
您将要使用以下内容: https://cloud.google.com/appengine/articles/deferred
You will want to use this: https://cloud.google.com/appengine/articles/deferred
创建一个10分钟超时的任务.然后,您可以立即返回给用户,他们可以稍后通过另一个处理程序(您创建的)提取"结果.如果收集所有URL花费的时间超过10分钟,则必须将其拆分为更多任务.
to create a task that has a 10 minute time out. Then you can return instantly to the user and they can "pick up" the results at a later time via another handler (that you create). If collecting all the URLs takes longer then 10 minutes you'll have to split them up into further tasks.
查看此内容: https://cloud.google.com/appengine/articles/deadlineexceedederrors
了解为什么不能再超过60秒.
to understand why you cannot go longer then 60 seconds.
这篇关于使用Google App Engine在python中获取很多网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!