本文介绍了内存有效的内置SqlAlchemy迭代器/生成器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个〜10M记录的MySQL表,可以通过SqlAlchemy与之交互.我发现,即使我以为我使用的是内置生成器,它可以智能地获取数据集的一口大小的数据块,但对该表的较大子集的查询将消耗过多的内存:

I have a ~10M record MySQL table that I interface with using SqlAlchemy. I have found that queries on large subsets of this table will consume too much memory even though I thought I was using a built-in generator that intelligently fetched bite-sized chunks of the dataset:

for thing in session.query(Things):
    analyze(thing)

为避免这种情况,我发现我必须构建自己的迭代器,该迭代器会分块地咬住:

To avoid this, I find I have to build my own iterator that bites off in chunks:

lastThingID = None
while True:
    things = query.filter(Thing.id < lastThingID).limit(querySize).all()
    if not rows or len(rows) == 0:
        break
    for thing in things:
        lastThingID = row.id
        analyze(thing)

这是正常现象还是关于SA内置发电机的缺失?

Is this normal or is there something I'm missing regarding SA built-in generators?

此问题的答案似乎表明内存消耗是不希望的.

The answer to this question seems to indicate that the memory consumption is not to be expected.

推荐答案

大多数DBAPI实现在获取行时都会完全缓冲行-因此通常,在SQLAlchemy ORM甚至保留一个结果之前,整个结果集就在内存中

Most DBAPI implementations fully buffer rows as they are fetched - so usually, before the SQLAlchemy ORM even gets a hold of one result, the whole result set is in memory.

但是,Query的工作方式是,默认情况下,它完全加载给定的结果集,然后再返回给您的对象.这里的基本原理涉及的查询不仅仅是简单的SELECT语句.例如,在连接到可能在一个结果集中多次返回相同对象标识的其他表中(与急切加载相同),整个行集都需要存储在内存中,以便可以返回正确的结果,否则返回集合.可能只是部分填充.

But then, the way Query works is that it fully loads the given result set by default before returning to you your objects. The rationale here regards queries that are more than simple SELECT statements. For example, in joins to other tables that may return the same object identity multiple times in one result set (common with eager loading), the full set of rows needs to be in memory so that the correct results can be returned otherwise collections and such might be only partially populated.

因此Query提供了一个选项,可以通过 yield_per() .此调用将导致Query批量产生行,并在其中指定批量大小.正如文档所述,这仅在您不进行任何急于加载的集合的情况下才是合适的,因此基本上是在您真的知道自己在做什么的情况下.另外,如果底层的DBAPI预先缓冲了行,则仍然会有内存开销,因此该方法的伸缩性仅比不使用它好.

So Query offers an option to change this behavior through yield_per(). This call will cause the Query to yield rows in batches, where you give it the batch size. As the docs state, this is only appropriate if you aren't doing any kind of eager loading of collections so it's basically if you really know what you're doing. Also, if the underlying DBAPI pre-buffers rows, there will still be that memory overhead so the approach only scales slightly better than not using it.

我几乎从不使用yield_per();取而代之的是,我使用上面建议的使用窗口函数的LIMIT方法的更好版本. LIMIT和OFFSET存在一个巨大的问题,即非常大的OFFSET值会导致查询变得越来越慢,因为N的OFFSET会使它分页浏览N行-就像执行相同的查询而不是一次,每次查询50次行数越来越大.使用窗口函数方法,我预取了一组窗口"值,这些值引用了我要选择的表的块.然后,我发出单独的SELECT语句,每个语句一次从一个窗口中拉出.

I hardly ever use yield_per(); instead, I use a better version of the LIMIT approach you suggest above using window functions. LIMIT and OFFSET have a huge problem that very large OFFSET values cause the query to get slower and slower, as an OFFSET of N causes it to page through N rows - it's like doing the same query fifty times instead of one, each time reading a larger and larger number of rows. With a window-function approach, I pre-fetch a set of "window" values that refer to chunks of the table I want to select. I then emit individual SELECT statements that each pull from one of those windows at a time.

窗口函数方法是Wiki上的 ,我非常喜欢使用它成功.

The window function approach is on the wiki and I use it with great success.

还请注意:并非所有数据库都支持窗口功能.您需要Postgresql,Oracle或SQL Server.恕我直言,至少使用Postgresql绝对值得-如果您使用的是关系数据库,则最好使用最好的数据库.

Also note: not all databases support window functions; you need Postgresql, Oracle, or SQL Server. IMHO using at least Postgresql is definitely worth it - if you're using a relational database, you might as well use the best.

这篇关于内存有效的内置SqlAlchemy迭代器/生成器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-13 12:36
查看更多