问题描述
如何从速度的角度改进以下功能?理想情况下,我希望使用 executemany 作为此启动过程的一部分.
How can I improve the following function from a speed point of view? Ideally, I'd like this to use executemany as part of this up process.
虽然该功能运行良好,但我确信有一种更有效的方法可以做到这一点;检查值是否存在并在适用时更新/插入.
While the function works fine, I'm sure there is a more efficient way to do this; check if a value exists and update/insert as applicable.
我需要每天处理数百万条数据.
I need to do this daily across millions of data.
def insert_or_update(self, profile_id, landing_page, keyword, position, impressions, clicks, ctr):
''' checks if a entry exists, if not it's inserted. If it is, the metrics
are updated.
'''
try:
self.cursor.execute('select id, position, impressions, clicks, ctr from temp where profile_id={} and keyword="{}" and landing_page="{}"'.format(profile_id, keyword, landing_page))
data = self.cursor.fetchone()
if data:
row_id = data[0]
sql_impressions = data[2] + impressions
sql_clicks = data[3] + clicks
sql_ctr = sum([data[4], ctr]) / len([data[4], ctr])
# if the keyword/landing_page exists
self.cursor.execute("update temp set position={}, impressions={}, clicks={}, ctr={} where id={}".format(position, sql_impressions, sql_clicks, sql_ctr, row_id))
# Commit your changes in the database
self.db.commit()
return self.cursor.lastrowid
else:
# if the keyword/landing_page doesn't exist
self.cursor.execute("insert into temp (profile_id, landing_page, keyword, position, impressions, clicks, ctr) values (%s, %s, %s, %s, %s, %s, %s)", (profile_id, landing_page, keyword, position, impressions, clicks, ctr))
# Commit your changes in the database
self.db.commit()
return self.cursor.lastrowid
except Exception as e:
return e
# Rollback in case there is any error
self.db.rollback()
finally:
self.db.close()
推荐答案
如果您需要执行数百万次此操作,这里会出现一系列性能问题.
There's a bunch of performance problems here if you need to do this millions of times.
您一遍又一遍地准备相同的 SQL 语句,数百万次.最好准备一次并执行数百万次.
You're preparing the same SQL statement over and over again, millions of times. It would work better to prepare it once and execute it millions of times.
在一次查询之后,您在每次函数调用时都与数据库断开连接.这意味着您每次都需要重新连接,任何缓存的信息都会被丢弃.不要那样做,让它保持连接.
You're disconnecting from the database on every function call after a single query. That means you need to reconnect each time and any cached information is thrown away. Don't do that, leave it connected.
您在每一行之后提交.这会减慢速度.相反,在执行批处理后提交.
You're committing after each row. This will slow things down. Instead, commit after doing a batch.
选择 + 更新或插入可能可以作为单个 upsert 完成.
The select + update or insert can probably be done as a single upsert.
您在临时表中插入如此多的内容可能是性能问题.
That you're inserting so much into a temp table is probably a performance issue.
如果表中的索引过多会减慢插入速度.有时最好删除索引,进行大批量更新,然后重新创建它们.
If the table has too many indexes that can slow inserts. Sometimes it's best to drop indexes, do a big batch update, and recreate them.
因为您将值直接放入 SQL 中,所以您的 SQL 对 SQL 注入开放攻击.
Because you're putting values directly into your SQL, your SQL is open to a SQL injection attack.
相反...
- 使用准备好的语句并绑定参数
- 保持数据库连接
- 批量更新
- 仅在更新运行结束时提交
- 在
UPDATE
中做所有的数学运算,而不是SELECT + math + UPDATE
. - 使用UPSERT"代替
SELECT
然后UPDATE
或INSERT
- Use prepared statements and bind parameters
- Leave the database connected
- Do updates in bulk
- Only commit at the end of a run of updates
- Do all the math in the
UPDATE
rather thenSELECT + math + UPDATE
. - Use an "UPSERT" instead of
SELECT
thenUPDATE
orINSERT
首先,准备好的语句.这些让 MySQL 编译语句一次,然后重用它.这个想法是你用占位符为值编写一个语句.
First off, prepared statements. These let MySQL compile the statement once and then reuse it. The idea is you write a statement with placeholders for the values.
select id, position, impressions, clicks, ctr
from temp
where profile_id=%s and
keyword=%s and
landing_page=%s
然后将值作为参数执行,而不是作为字符串的一部分.
Then you execute that with the values as arguments, not as part of the string.
self.cursor.execute(
'select id, position, impressions, clicks, ctr from temp where profile_id=%s and keyword=%s and landing_page=%s',
(profile_id, keyword, landing_page)
)
这允许数据库缓存准备好的语句,而不必每次都重新编译它.它还避免了 SQL 注入攻击,在这种攻击中,聪明的攻击者可以制作一个实际上更像 SQL 的值,例如 " MORE SQL HERE "
.这是一个非常、非常、非常常见的安全漏洞.
This allows the database to cache the prepared statement and not have to recompile it each time. It also avoids a SQL injection attack where a clever attacker can craft a value that is actually more SQL like " MORE SQL HERE "
. It is a very, very, very common security hole.
注意,您可能需要使用 MySQL 的自己的 Python 数据库库来获得真正的准备好的语句.不要太担心,使用准备好的语句并不是你最大的性能问题.
Note, you might need to use MySQL's own Python database library to get true prepared statements. Don't worry about it too much, using prepared statements is not your biggest performance problem.
接下来,您基本上要做的是添加到现有行,或者如果没有现有行,则插入一个新行.这可以在包含 UPSERT
、组合的 INSERT
和 UPDATE
的单个语句中更有效地完成.MySQL 将其作为 INSERT ... ON DUPLICATE KEY UPDATE
.
Next, what you're basically doing is adding to an existing row, or if there is no existing row, inserting a new one. This can be done more efficiently in a single statement with an UPSERT
, a combined INSERT
and UPDATE
. MySQL has it as INSERT ... ON DUPLICATE KEY UPDATE
.
要了解这是如何完成的,我们可以将您的 SELECT then UPDATE
编写为单个 UPDATE
.计算在 SQL 中完成.
To see how this is done, we can write your SELECT then UPDATE
as a single UPDATE
. The calculations are done in the SQL.
update temp
set impressions = impressions + %s,
clicks = clicks + %s,
ctr = (ctr + %s / 2)
where profile_id=%s and
keyword=%s and
landing_page=%s
您的 INSERT 保持不变...
Your INSERT remains the same...
insert into temp
(profile_id, landing_page, keyword, position, impressions, clicks, ctr)
values (%s, %s, %s, %s, %s, %s, %s)
将它们合并为一个 INSERT ON DUPLICATE KEY UPDATE.
Combine them into one INSERT ON DUPLICATE KEY UPDATE.
insert into temp
(profile_id, landing_page, keyword, position, impressions, clicks, ctr)
values (%s, %s, %s, %s, %s, %s, %s)
on duplicate key update
update temp
set impressions = impressions + %s,
clicks = clicks + %s,
ctr = (ctr + %s / 2)
这取决于表的键被定义为什么.如果你有 unique( profile_id, land_page, keyword )
那么它应该和你的代码一样工作.
This depends on what the keys of the table are defined as. If you have unique( profile_id, landing_page, keyword )
then it should work the same as your code.
即使你不能做 upsert,你也可以通过尝试 UPDATE
来消除 SELECT
,检查它是否更新了任何东西,如果它没有做一个 INSERT
.
Even if you can't do the upsert, you can eliminate the SELECT
by trying the UPDATE
, checking if it updated anything, and if it didn't doing an INSERT
.
批量更新.与其调用执行一次更新和提交的子例程,不如将一大堆要更新的内容传递给它并循环处理它们.您甚至可以利用 executemany
使用多个值运行相同的语句.然后提交.
Do the updates in bulk. Instead of calling a subroutine which does one update and commits, pass it a big list of things to be updated and work on them in a loop. You can even take advantage of executemany
to run the same statement with multiple values. Then commit.
您或许可以批量执行UPSERT
.INSERT
可以一次取多行.例如,这将插入三行.
You might be able to do the UPSERT
in bulk. INSERT
can take multiple rows at once. For example, this inserts three rows.
insert into whatever
(foo, bar, baz)
values (1, 2, 3),
(4, 5, 6),
(7, 8, 9)
您可以对 INSERT ON DUPLICATE KEY UPDATE
做同样的事情,减少与数据库对话的开销.请参阅这篇文章的示例(在 PHP 中,但你应该能够适应).
You can likely do the same with your INSERT ON DUPLICATE KEY UPDATE
reducing the amount of overhead to talk to the database. See this post for an example (in PHP, but you should be able to adapt).
这牺牲了返回最后插入行的 ID,但它们是中断.
This sacrifices returning the ID of the last inserted row, but them's the breaks.
这篇关于加速 MySQL 更新/插入语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!