问题描述
我正在使用python3
中的dryscrape
编写剪贴簿.我试图在一个抓取会话中访问数百个不同的URL,并单击每个URL上的大约10个Ajax页面(而不是每个ajax页面访问一个不同的URL).我需要类似dryscrape
之类的东西,因为我需要能够与javascript组件进行交互.我为我的需求编写的类可以正常工作,但是当我访问约50或100页时(我使用了所有4Gbs的内存,而4Gbs的交换磁盘空间实际上是100%满的),我的内存用完了.我查看了内存用完了什么,看来webkit_server
进程负责所有这些事情.为什么会发生这种情况,我该如何避免呢?
I am writing a scrapper using dryscrape
in python3
. I am trying to visit hundreds of different urls during a scrapping session and click through about 10 ajax pages on each url (without visiting a different url per ajax page). I need something like dryscrape
because I need to be able to interact with javascript components. The classes I wrote for my needs work, but I am running out of memory when I am have visited about 50 or 100 pages (all 4Gbs of memory are used and 4Gbs of swap disk space is virtually 100% full). I looked at what is using up the memory and it appears that webkit_server
process is responsible for all of it. Why is this happening and how can I avoid it?
下面是我的课堂和我的主要方法的片段.
Below are the relevant snippets of my class and my main method.
这里是使用dryscape
的类,您可以确切地看到我正在使用的设置.
Here is the class which uses dryscape
and you can see exactly what settings I am using.
import dryscrape
from lxml import html
from time import sleep
from webkit_server import InvalidResponseError
import re
from utils import unugly, my_strip, cleanhtml, stringify_children
from Profile import Profile, Question
class ExampleSession():
def __init__(self, settings):
self.settings = settings
# dryscrape.start_xvfb()
self.br = self.getBrowser()
def getBrowser(self):
session = dryscrape.Session()
session.set_attribute('auto_load_images', False)
session.set_header('User-agent', 'Google Chrome')
return session
def login(self):
try:
print('Trying to log in... ')
self.br.visit('https://www.example.com/login')
self.br.at_xpath('//*[@id="login_username"]').set(self.settings['myUsername'])
self.br.at_xpath('//*[@id="login_password"]').set(self.settings['myPassword'])
q = self.br.at_xpath('//*[@id="loginbox_form"]')
q.submit()
except Exception as e:
print(str(e))
print('\tException and couldn\'t log in!')
return
print('Logged in as %s' % (str(self.settings['myUsername'])))
def getProfileQuestionsByUrl(self, url, thread_id=0):
self.br.visit(str(url.rstrip()) + '/questions')
tree = html.fromstring(self.br.body())
questions = []
num_pages = int(my_strip(tree.xpath('//*[@id="questions_pages"]//*[@class="last"]')[0].text))
page = 0
while (page < num_pages):
sleep(0.5)
# Do something with each ajax page
# Next try-except tries to click the 'next' button
try:
next_button = self.br.at_xpath('//*[@id="questions_pages"]//*[@class="next"]')
next_button.click()
except Exception as e:
pass
page = page + 1
return questions
def getProfileByUrl(self, url, thread_id=0):
missing = 'NA'
try:
try:
# Visit a unique url
self.br.visit(url.rstrip())
except Exception as e:
print(str(e))
return None
tree = html.fromstring(self.br.body())
map = {}
# Fill up the dictionary with some things I find on the page
profile = Profile(map)
return profile
except Exception as e:
print(str(e))
return None
这是主要方法(摘要):
Here is the main method (snippet):
def getProfiles(settings, urls, thread_id):
exampleSess = ExampleSession(settings)
exampleSess.login()
profiles = []
'''
I want to visit at most a thousand unique urls (but I don't care if it
will take 2 hours or 2 days as long as the session doesn't fatally break
and my laptop doesn't run out of memory)
'''
for url in urls:
try:
profile = exampleSess.getProfileByUrl(url, thread_id)
if (profile is not None):
profiles.append(profile)
try:
if (settings['scrapeQuestions'] == 'yes'):
profile_questions = exampleSess.getProfileQuestionsByUrl(url, thread_id)
if (profile_questions is not None):
profile.add_questions(profile_questions)
except SocketError as e:
print(str(e))
print('\t[Thread %d] SocketError in getProfileQuestionsByUrl of profile...' % (thread_id))
except Exception as e:
print(str(e))
print('\t[Thread %d] Exception while getting profile %s' % (thread_id, str(url.rstrip())))
okc.br.reset()
exampleSession = None # Does this kill my dryscrape session and prevents webkit_server from running?
return profiles
我是否正确设置了dryscrape
?我使用getProfileByUrl
和getProfileQuestionsByUrl
访问的urls
越多,dryscrape
的webkit_server
最终如何使用4Gbs以上?我缺少任何可能会使内存使用复杂化的设置吗?
Do I have dryscrape
set up correctly? How does dryscrape
's webkit_server
end up using upwards of 4Gbs the more urls
I visit with getProfileByUrl
and with getProfileQuestionsByUrl
? Are there any settings that I am missing that might be compounding memory use?
推荐答案
我无法解决内存问题(我可以在另一台笔记本电脑上重现此问题).我最终从dryscrape
切换到selenium
(然后又切换到phantomjs
).我认为PhantomJS
优越,并且也不占用太多内存.
I couldn't resolve the memory issue (and I could reproduce this issue on a separate laptop). I ended up switching from dryscrape
to selenium
(and then to phantomjs
). PhantomJS
has been superior in my opinion and it is not taking up a lot of memory either.
这篇关于Webkit_server(从python的dryscrape中调用)在访问每个页面时使用越来越多的内存.如何减少使用的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!