问题描述
我正在尝试从数据库中提取大量数据并将其写入csv文件.我正在尝试找出最快的方法是这样做.我发现在fetchall的结果上运行writerows比下面的代码慢40%.
I'm trying to extract huge amounts of data from a DB and write it to a csv file. I'm trying to find out what the fastest way would be to do this. I found that running writerows on the result of a fetchall was 40% slower than the code below.
with open(filename, 'a') as f:
writer = csv.writer(f, delimiter='\t')
cursor.execute("SELECT * FROM table")
writer.writerow([i[0] for i in cursor.description])
count = 0
builder = []
row = cursor.fetchone()
DELIMITERS = ['\t'] * (len(row) - 1) + ['\n']
while row:
count += 1
# Add row with delimiters to builder
builder += [str(item) for pair in zip(row, DELIMITERS) for item in pair]
if count == 1000:
count = 0
f.write(''.join(builder))
builder[:] = []
row = cursor.fetchone()
f.write(''.join(builder))
我正在使用的数据库是我所工作的小型公司所独有的,因此,不幸的是,我在这方面无法提供太多信息.我使用jpype连接数据库,因为连接的唯一方法是通过jdbc驱动程序.我正在运行cPython 2.7.5;很想使用PyPy,但不适用于Pandas.
The database I'm using is unique to the small company that I'm working for, so unfortunately I can't provide much information on that front. I'm using jpype to connect with the database since the only means of connecting is via a jdbc driver. I'm running cPython 2.7.5; would love to use PyPy but it doesn't work with Pandas.
由于我要提取大量行,因此我在使用fetchall时犹豫不决,因为担心会耗尽内存. row
具有可比的性能,并且在眼睛上容易得多,所以我认为我会使用它.谢谢一堆!
Since I'm extracting such a large number of rows, I'm hesitant to use fetchall for fear that I'll run out of memory. row
has comparable performance and is much easier on the eyes, so I think I'll use that. Thanks a bunch!
推荐答案
您已经给我们提供了一些帮助,很难更具体了,但是……
With the little you've given us to go on, it's hard to be more specific, but…
我将您的代码包装为一个函数,并编写了三个替代版本:
I've wrapped your code up as a function, and written three alternative versions:
def row():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
for row in cursor:
writer.writerow(row)
def rows():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor)
def rowsall():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor.fetchall())
请注意,最后一个是您说过的内容.
Notice that the last one is the one you say you tried.
现在,我编写了这个测试驱动程序:
Now, I wrote this test driver:
def randomname():
return ''.join(random.choice(string.ascii_lowercase) for _ in range(30))
db = sqlite3.connect(':memory:')
db.execute('CREATE TABLE mytable (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR)')
db.executemany('INSERT INTO mytable (name) VALUES (?)',
[[randomname()] for _ in range(10000)])
filename = 'db.csv'
for f in manual, row, rows, rowsall:
t = timeit.timeit(f, number=1)
print('{:<10} {}'.format(f.__name__, t))
结果如下:
manual 0.055549702141433954
row 0.03852885402739048
rows 0.03992213006131351
rowsall 0.02850699401460588
因此,在我的测试中,您的代码花费的时间几乎是调用 fetchall
和 writerows
的时间的两倍!
So, your code takes nearly twice as long as calling fetchall
and writerows
in my test!
但是,当我对其他数据库重复进行类似的测试时, rowsall
的速度要比 manual
慢20%到15%(从不慢40%,但是高达15%)…,但是 row
或 rows
总是比 manual
快得多.
When I repeat a similar test with other databases, however, rowsall
is anywhere from 20% faster to 15% slower than manual
(never 40% slower, but as much as 15%)… but row
or rows
is always significantly faster than manual
.
我认为这是因为您的自定义代码比 csv.writerows
慢得多,但是在某些数据库中,使用 fetchall
而不是 fetchone 代码>(或仅迭代光标)会大大降低速度.对于内存中的sqlite3数据库而言,这不是正确的原因是
fetchone
与 fetchall
做着所有相同的工作,然后一次向您提供列表;对于远程数据库, fetchone
可能会执行所有操作,从获取所有行到一次获取缓冲区,再到一次获取一行,使其可能比 fetchall慢或快得多.
,具体取决于您的数据.
I think the explanation is that your custom code is significantly slower than csv.writerows
, but that in some databases, using fetchall
instead of fetchone
(or just iterating the cursor) slows things down significantly. The reason this isn't true with an in-memory sqlite3 database is that fetchone
is doing all of the same work as fetchall
and then feeding you the list one at a time; with a remote database, fetchone
may do anything from fetch all the lines, to fetching a buffer at a time, to fetching a row at a time, making it potentially much slower or faster than fetchall
, depending on your data.
但是对于真正有用的解释,您必须确切地告诉我们您正在使用哪个数据库和库(以及哪个Python版本-CPython 3.3.2的 csv
模块似乎很多速度比CPython 2.7.5更快,而PyPy 2.1/2.7.2似乎也比CPython 2.7.5快,但是任何一个也可能也可以更快地运行您的代码……)等等.
But for a really useful explanation, you'd have to tell us exactly which database and library you're using (and which Python version—CPython 3.3.2's csv
module seems to be a lot faster than CPython 2.7.5's, and PyPy 2.1/2.7.2 seems to be faster than CPython 2.7.5 as well, but then either one also might run your code faster too…) and so on.
这篇关于在python中将数据库表写入文件的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!