我有6100万个状态不唯一的电子邮件。
这些电子邮件需要按状态使用逻辑进行重复数据消除。
我编写存储过程,但此过程会运行很长时间。
如何优化此过程的执行时间?

CREATE OR REPLACE FUNCTION public.load_oxy_emails() RETURNS boolean AS $$

DECLARE
        row record;
        rec record;
        new_id int;
BEGIN
        FOR row IN SELECT * FROM oxy_email ORDER BY id LOOP
                SELECT * INTO rec FROM oxy_emails_clean WHERE email = row.email;
                IF rec IS NOT NULL THEN
                        IF row.status = 3 THEN
                                UPDATE oxy_emails_clean SET status = 3 WHERE id = rec.id;
                        END IF;
                ELSE
                        INSERT INTO oxy_emails_clean(id, email, status) VALUES(nextval('oxy_emails_clean_id_seq'), row.email, row.status);
                        SELECT currval('oxy_emails_clean_id_seq') INTO new_id;
                        INSERT INTO oxy_emails_clean_websites_relation(oxy_emails_clean_id, website_id) VALUES(new_id, row.website_id);
                END IF;
        END LOOP;
        RETURN true;
END;
$$
LANGUAGE 'plpgsql';

最佳答案

如何优化此过程的执行时间?
不要用循环来做。
逐行处理(也称为“slow by slow”)几乎总是比批量更改慢得多,其中一条语句“一次性”处理了许多行。
使用一条语句可以很容易地更改状态:

update oxy_emails_clean oec
    SET status = 3
from oxy_email oe
where oe.id = oec.id
  and oe.status = 3;

可以使用CTEs的链来复制行:
with to_copy as (
  select *
  from oxy_email
  where status <> 3 --<< all those that have a different status
), clean_inserted as (
  INSERT INTO oxy_emails_clean (id, email, status)
  select nextval('oxy_emails_clean_id_seq'), email, status
  from to_copy
  returning id;
)
insert oxy_emails_clean_websites_relation (oxy_emails_clean_id, website_id)
select ci.id, tc.website_id
from clean_inserted ci
  join to_copy tc on tc.id = ci.id;

关于postgresql - 如何优化PostgreSQL程序,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/42116890/

10-11 07:10