问题描述
我的产品列表(300万件)没有ID - 只有标题。但是我不知道DB中已经存在哪些标题。新产品(约290万件)必须加入DB。之后,我必须知道每个产品的ID(新的和现有的)。
I have list of products (3 million items) without IDs - only titles. But I don't know which titles already exist in DB. New products (about 2.9 million items) must be added into DB. After that I must know ID for each products (new and existing).
在PostgreSQL中有最快的方法吗?我可以根据需要更改数据库(添加默认值,添加列等)。
Is there the fastest way to do it in PostgreSQL? I can change DB as needed (add default values, add columns etc.).
推荐答案
导入数据
将所有内容设置为临时暂存表,并只在目标表中插入新的标题。
Import data
COPY
everything to a temporary staging table and insert only new titles into your target table.
CREATE TEMP TABLE tmp(title text);
COPY tmp FROM 'path/to/file.csv';
ANALYZE tmp;
INSERT INTO tbl
SELECT DISTINCT tmp.title
FROM tmp
LEFT JOIN tbl USING (title)
WHERE tbl.title IS NULL;
应该使用列 tbl_id
tbl
。
LEFT JOIN
/ IS NULL
构造取消已经存在的标题的资格。 NOT EXISTS
将是另一种可能性。
The LEFT JOIN
/ IS NULL
construct disqualifies already existing titles. NOT EXISTS
would be another possibility.
DISTINCT
在临时表中的传入数据中复制 tmp
。
DISTINCT
prevents duplicates in the incoming data in the temporary table tmp
.
有助于确保查询计划员选择一个明智的计划和临时表不通过autovacuum进行分析。
ANALYZE
is useful to make sure the query planner picks a sensible plan, and temporary tables are not analyzed by autovacuum.
由于您有300万个项目,因此可能需要为 temp_buffer
( )
Since you have 3 million items, it might pay to raise the setting for temp_buffer
(for this session only):
SET temp_buffers = 1000MB;
或者你可以负担得起,足够让RAM中的临时表保持更快。注意:必须先在会话中完成 ,然后才能创建任何临时对象。
Or however much you can afford and is enough to hold the temp table in RAM, which is much faster. Note: must be done first in the session - before any temp objects are created.
要查看导入数据的所有ID:
To see all IDs for the imported data:
SELECT tbl.tbl_id, tbl.title
FROM tbl
JOIN tmp USING (title)
在同一个会话中!临时表在会话结束时自动删除。
In the same session! A temporary table is dropped automatically at the end of the session.
这篇关于如何在PostreSQL中批量插入新行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!