问题描述
我有一个解析文件的功能,并使用SQLAlchemy将数据插入MySQL.我一直在os.listdir()
的结果上顺序运行该函数,并且一切正常.
I have a function that parses a file and inserts the data into MySQL using SQLAlchemy. I've been running the function sequentially on the result of os.listdir()
and everything works perfectly.
因为大部分时间都花在读取文件和写入数据库上,所以我想使用多处理来加快处理速度.这是我的pseduocode,因为实际代码太长:
Because most of the time is spent reading the file and writing to the DB, I wanted to use multiprocessing to speed things up. Here is my pseduocode as the actual code is too long:
def parse_file(filename):
f = open(filename, 'rb')
data = f.read()
f.close()
soup = BeautifulSoup(data,features="lxml", from_encoding='utf-8')
# parse file here
db_record = MyDBRecord(parsed_data)
session.add(db_record)
session.commit()
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
我看到的问题是脚本挂起并且永远无法完成.我通常将63条记录中的48条放入数据库.有时更多,有时更少.
The problem I'm seeing is that the script hangs and never finishes. I usually get 48 of 63 records into the database. Sometimes it's more, sometimes it's less.
我尝试使用pool.close()
并与pool.join()
结合使用,但似乎都没有帮助.
I've tried using pool.close()
and in combination with pool.join()
and neither seems to help.
如何完成此脚本?我究竟做错了什么?我在Linux机器上使用的是Python 2.7.8.
How do I get this script to complete? What am I doing wrong? I'm using Python 2.7.8 on a Linux box.
推荐答案
问题是两件事的结合:
- 我的池代码被多次调用(感谢@Peter Wood)
- 我的数据库代码打开了太多的会话(和/或)共享会话
我进行了以下更改,现在一切正常:原始文件
I made the following changes and everything works now:Original File
def parse_file(filename):
f = open(filename, 'rb')
data = f.read()
f.close()
soup = BeautifulSoup(data,features="lxml", from_encoding='utf-8')
# parse file here
db_record = MyDBRecord(parsed_data)
session = get_session() # see below
session.add(db_record)
session.commit()
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
数据库文件
def get_session():
engine = create_engine('mysql://root:root@localhost/my_db')
Base.metadata.create_all(engine)
Base.metadata.bind = engine
db_session = sessionmaker(bind=engine)
return db_session()
这篇关于Python多处理池挂在地图调用上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!