本文介绍了用于 MySQL 的 Python scrapy 持久数据库连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为我的一个项目使用scrapy.数据从蜘蛛程序中抓取并传递到管道以插入到数据库中.这是我的数据库类代码:

I am using scrapy for one of my project. The data gets scraped from spider and gets passed to pipeline for insertion into database. Here is my database class code:

import MySQLdb


class Database:

    host = 'localhost'
    user = 'root'
    password = 'test123'
    db = 'scraping_db'

    def __init__(self):
        self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db,use_unicode=True, charset="utf8")
        self.cursor = self.connection.cursor()

    def insert(self, query,params):
        try:
            self.cursor.execute(query,params)
            self.connection.commit()
        except Exception as ex:
            self.connection.rollback()


    def __del__(self):
        self.connection.close()

这是我的管道代码,用于处理抓取的项目并保存到 MySQL 数据库中.

Here is my pipeline code that processes scraped items and saves into MySQL database.

from con import Database

class LinkPipeline(object):

    def __init__(self):
        self.db=Database()

    def process_item(self, item, spider):
        query="""INSERT INTO links (title, location,company_name,posted_date,status,company_id,scraped_link,content,detail_link,job_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s,%s)"""
        params=(item['title'], item['location'], item['company_name'], item['posted_date'], item['status'], item['company_id'], item['scraped_link'], item['content'], item['detail_link'],item['job_id'])
        self.db.insert(query,params)
        return item

从上面的流程中,我觉得每当 Item 通过管道进行处理时,就会在 process_item 完成时打开和关闭数据库连接.这会打开过多的数据库连接.我想要一种方法,我的数据库连接在蜘蛛的整个生命周期中只打开一次,并在蜘蛛关闭时关闭.

From above flow I feel whenever Item is processed via pipeline then a database connection is opened and closed when process_item is complete. This would open too much database connections. I want a way where my database connection is only opened once during the whole life cycle of spider and closed when spider is closed.

我读到 Spider 类中有 open_spider 和 close_spider 方法,如果我使用它们,那么如何将数据库连接的引用从 Spider 的 start_requests 方法传递给管道类?

I read there are open_spider and close_spider method in Spider class, if I use them then how can I pass the reference to database connection from Spider's start_requests method to pipeline class?

有没有更好的方法来解决这个问题?

Are there any better approaches to go about it?

推荐答案

class MySpider(scrapy.Spider):
    name = "myspidername"

    host = 'localhost'
    user = 'root'
    password = 'test123'
    db = 'scraping_db'

    def __init__(self):
        self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db,use_unicode=True, charset="utf8")
        self.cursor = self.connection.cursor()

    def insert(self, query,params):
        try:
            self.cursor.execute(query,params)
            self.connection.commit()
        except Exception as ex:
            self.connection.rollback()


    def __del__(self):
        self.connection.close()

然后在您的流水线中执行此 spider.cursor 以访问 cursor 并执行任何 MySQL 操作.

then in your Pipeline do this spider.cursor to access cursor and perform any MySQL operation.

class LinkPipeline(object):

    def process_item(self, item, spider):
        query="""INSERT INTO links (title, location,company_name,posted_date,status,company_id,scraped_link,content,detail_link,job_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s,%s)"""
        params=(item['title'], item['location'], item['company_name'], item['posted_date'], item['status'], item['company_id'], item['scraped_link'], item['content'], item['detail_link'],item['job_id'])
        spider.cursor.insert(query,params)
        return item

这篇关于用于 MySQL 的 Python scrapy 持久数据库连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 15:05