问题描述
update testdata.dataset1
set abcd = (select abc
from dataset2
order by random()
limit 1
)
这样做只会使表dataset2
中的一个随机条目填充到dataset1
表的所有行中.
Doing this only makes one random entry from table dataset2
is getting populated in all the rows of dataset1
table.
我需要的是生成从dataset2
表到dataset1
表的随机条目的每一行.
What I need is to generate each row with random entry from dataset2
table to dataset1
table.
注意:dataset1
可以大于dataset2
.
推荐答案
查询1
您应该将abcd
传递到子查询中,以防止优化".
You should pass abcd
into your subquery to prevent "optimizing".
UPDATE dataset1
SET abcd = (SELECT abc
FROM dataset2
WHERE abcd = abcd
ORDER BY random()
LIMIT 1
);
查询2
下面的查询在纯PostgreSQL上应该更快.
The query below should be faster on plain PostgreSQL.
UPDATE dataset1
SET abcd = (SELECT abc
FROM dataset2
WHERE abcd = abcd
OFFSET floor(random()*(SELECT COUNT(*) FROM dataset2))
LIMIT 1
);
但是,正如您所报告的,在列式存储Redshift上不是这种情况.
However, as you have reported, it is not the case on Redshift, which is a columnar storage.
查询3
在单个查询中从dataset2
获取所有记录将比一次一个地获取记录更为有效.让我们测试一下:
Fetching all the records from dataset2
in a single query would be more efficient than fetching records one by one. Let's test:
UPDATE dataset1 original
SET abcd = fake.abc FROM
(SELECT ROW_NUMBER() OVER(ORDER BY random()) AS id, abc FROM dataset2) AS fake
WHERE original.id % (SELECT COUNT(*) FROM dataset2) = fake.id - 1;
请注意,整数id
列应存在于dataset1
中.
同样,对于大于dataset2
中的记录数的dataset1.id
,abcd
是可以预测的.
Note that the integer id
column should exist in dataset1
.
Also, for dataset1.id
's that are greater than the number of records in dataset2
, abcd
's are predictable.
查询4
让我们在dataset1
中创建整数fake_id
列,用随机值预填充并在dataset1.fake_id = dataset2.id
上执行连接:
Let's create the integer fake_id
column in dataset1
, prefill it with random values and perform join on dataset1.fake_id = dataset2.id
:
UPDATE dataset1
SET fake_id = floor(random()*(SELECT COUNT(*) FROM dataset2)) + 1;
UPDATE dataset1
SET abcd = abc
FROM dataset2
WHERE dataset1.fake_id = dataset2.id;
查询5
如果您不想将fake_id
列添加到dataset1
,让我们计算fake_id
的即时":
If you don't want to add fake_id
column to dataset1
, let's calculate fake_id
's "on the fly":
UPDATE dataset1
SET abcd = abc
FROM (
SELECT with_fake_id.id, dataset2.abc FROM
(SELECT dataset1.id, floor(RANDOM()*(SELECT COUNT(*) FROM dataset2) + 1) AS fake_id FROM dataset1) AS with_fake_id
JOIN dataset2 ON with_fake_id.fake_id = dataset2.id ) AS joined
WHERE dataset1.id = joined.id;
性能
在普通PostgreSQL上,查询4似乎是最有效的.
我将尝试比较DC1.Large试用版实例的性能.
On plain PostgreSQL, query 4 seems to be the most efficient.
I'll try to compare performance on a trial DC1.Large instance.
这篇关于Redshift:使用来自另一个表的随机数据更新或插入列中的每一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!