本文介绍了Webkit_server(从python的dryscrape中调用)在访问每个页面时使用越来越多的内存.如何减少使用的内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python3中的dryscrape编写剪贴簿.我试图在一个抓取会话中访问数百个不同的URL,并单击每个URL上的大约10个Ajax页面(而不是每个ajax页面访问一个不同的URL).我需要类似dryscrape之类的东西,因为我需要能够与javascript组件进行交互.我为我的需求编写的类可以正常工作,但是当我访问约50或100页时(我使用了所有4Gbs的内存,而4Gbs的交换磁盘空间实际上是100%满的),我的内存用完了.我查看了内存用完了什么,看来webkit_server进程负责所有这些事情.为什么会发生这种情况,我该如何避免呢?

I am writing a scrapper using dryscrape in python3. I am trying to visit hundreds of different urls during a scrapping session and click through about 10 ajax pages on each url (without visiting a different url per ajax page). I need something like dryscrape because I need to be able to interact with javascript components. The classes I wrote for my needs work, but I am running out of memory when I am have visited about 50 or 100 pages (all 4Gbs of memory are used and 4Gbs of swap disk space is virtually 100% full). I looked at what is using up the memory and it appears that webkit_server process is responsible for all of it. Why is this happening and how can I avoid it?

下面是我的课堂和我的主要方法的片段.

Below are the relevant snippets of my class and my main method.

这里是使用dryscape的类,您可以确切地看到我正在使用的设置.

Here is the class which uses dryscape and you can see exactly what settings I am using.

import dryscrape
from lxml import html
from time import sleep
from webkit_server import InvalidResponseError
import re

from utils import unugly, my_strip, cleanhtml, stringify_children
from Profile import Profile, Question

class ExampleSession():

    def __init__(self, settings):
        self.settings = settings
        # dryscrape.start_xvfb()
        self.br = self.getBrowser()

    def getBrowser(self):
        session = dryscrape.Session()
        session.set_attribute('auto_load_images', False)
        session.set_header('User-agent', 'Google Chrome')
        return session

    def login(self):
        try:
            print('Trying to log in... ')
            self.br.visit('https://www.example.com/login')
            self.br.at_xpath('//*[@id="login_username"]').set(self.settings['myUsername'])
            self.br.at_xpath('//*[@id="login_password"]').set(self.settings['myPassword'])
            q = self.br.at_xpath('//*[@id="loginbox_form"]')
            q.submit()
        except Exception as e:
            print(str(e))
            print('\tException and couldn\'t log in!')
            return
        print('Logged in as %s' % (str(self.settings['myUsername'])))

    def getProfileQuestionsByUrl(self, url, thread_id=0):
        self.br.visit(str(url.rstrip()) + '/questions')

        tree = html.fromstring(self.br.body())
        questions = []

        num_pages = int(my_strip(tree.xpath('//*[@id="questions_pages"]//*[@class="last"]')[0].text))

        page = 0
        while (page < num_pages):
            sleep(0.5)
            # Do something with each ajax page
            # Next try-except tries to click the 'next' button
            try:
                next_button = self.br.at_xpath('//*[@id="questions_pages"]//*[@class="next"]')
                next_button.click()
            except Exception as e:
                pass
            page = page + 1

        return questions

    def getProfileByUrl(self, url, thread_id=0):
        missing = 'NA'

        try:
            try:
                # Visit a unique url
                self.br.visit(url.rstrip())
            except Exception as e:
                print(str(e))
                return None
            tree = html.fromstring(self.br.body())

            map = {}
            # Fill up the dictionary with some things I find on the page

            profile = Profile(map)
            return profile
        except Exception as e:
            print(str(e))
            return None

这是主要方法(摘要):

Here is the main method (snippet):

def getProfiles(settings, urls, thread_id):
    exampleSess = ExampleSession(settings)
    exampleSess.login()

    profiles = []
    '''
    I want to visit at most a thousand unique urls (but I don't care if it
    will take 2 hours or 2 days as long as the session doesn't fatally break
    and my laptop doesn't run out of memory)
    '''
    for url in urls:
        try:
            profile = exampleSess.getProfileByUrl(url, thread_id)

            if (profile is not None):
                profiles.append(profile)

                try:
                    if (settings['scrapeQuestions'] == 'yes'):
                        profile_questions = exampleSess.getProfileQuestionsByUrl(url, thread_id)

                        if (profile_questions is not None):
                            profile.add_questions(profile_questions)
                except SocketError as e:
                    print(str(e))
                    print('\t[Thread %d] SocketError in getProfileQuestionsByUrl of profile...' % (thread_id))

        except Exception as e:
            print(str(e))
            print('\t[Thread %d] Exception while getting profile %s' % (thread_id, str(url.rstrip())))
            okc.br.reset()

    exampleSession = None # Does this kill my dryscrape session and prevents webkit_server from running?

    return profiles

我是否正确设置了dryscrape?我使用getProfileByUrlgetProfileQuestionsByUrl访问的urls越多,dryscrapewebkit_server最终如何使用4Gbs以上?我缺少任何可能会使内存使用复杂化的设置吗?

Do I have dryscrape set up correctly? How does dryscrape's webkit_server end up using upwards of 4Gbs the more urls I visit with getProfileByUrl and with getProfileQuestionsByUrl? Are there any settings that I am missing that might be compounding memory use?

推荐答案

我无法解决内存问题(我可以在另一台笔记本电脑上重现此问题).我最终从dryscrape切换到selenium(然后又切换到phantomjs).我认为PhantomJS优越,并且也不占用太多内存.

I couldn't resolve the memory issue (and I could reproduce this issue on a separate laptop). I ended up switching from dryscrape to selenium (and then to phantomjs). PhantomJS has been superior in my opinion and it is not taking up a lot of memory either.

这篇关于Webkit_server(从python的dryscrape中调用)在访问每个页面时使用越来越多的内存.如何减少使用的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-08 15:58