在具有超过10万行的表中,如何有效地对特定列的值进行混洗?

表定义:

CREATE TABLE person
(
  id integer NOT NULL,
  first_name character varying,
  last_name character varying,
 CONSTRAINT person_pkey PRIMARY KEY (id)
)

为了匿名化数据,我必须将 'first_name' 列 的值改组到 (我不允许创建新表)。

我的尝试:
with
first_names as (
select row_number() over (order by random()),
       first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()),
       id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;

需要几个小时才能完成。

有有效的方法吗?

最佳答案

postgres 的问题是每次更新都意味着 delete + insert

  • 您可以使用 SELECT 而不是 UPDATE 来检查分析以查看 CTE
  • 的性能
  • 您可以关闭索引以便更新更快
  • 但我在需要更新所有行时使用的最佳解决方案是再次创建表

  • .
    CREATE TABLE new_table AS
         SELECT * ....
    
    
    DROP oldtable;
    
    Rename new_table to old_table
    
    CREATE index and constrains
    

    抱歉,这不是您的选择:(

    编辑: 阅读 a_horse_with_no_name

    看起来你需要
    with
    first_names as (
        select row_number() over (order by random()) rn,
               first_name as new_first_name
        from person
    ),
    ids as (
        select row_number() over (order by random()) rn,
               id as ref_id
        from person
    )
    update person
    set first_name = new_first_name
    from first_names
    join ids
      on first_names.rn = ids.rn
    where id = ref_id;
    

    同样,如果您提供ANALYZE / EXPLAIN结果,则对性能的问题会更好。

    关于sql - PostgreSQL shuffle 列值,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/33555524/

    10-13 06:51