问题描述
将2.37亿条记录插入具有规则的表(用于在子表之间分配数据)的最快方法是什么?
What is the fastest way to insert 237 million records into a table that has rules (for distributing data across child tables)?
我尝试或考虑过:
- 插入语句。
- 事务插入(
BEGIN
和COMMIT
)。 -
COPY FROM
命令。
- Insert statements.
- Transactional inserts (
BEGIN
andCOMMIT
). - The
COPY FROM
command. - http://pgbulkload.projects.postgresql.org/
插入速度太慢(四天), COPY FROM
会忽略规则(
Inserts are too slow (four days) and COPY FROM
ignores rules (and has other issues).
示例数据:
station_id,taken,amount,category_id,flag
1,'1984-07-1',0,4,
1,'1984-07-2',0,4,
1,'1984-07-3',0,4,
1,'1984-07-4',0,4,T
表结构(包括一条规则):
Table structure (with one rule included):
CREATE TABLE climate.measurement
(
id bigserial NOT NULL,
station_id integer NOT NULL,
taken date NOT NULL,
amount numeric(8,2) NOT NULL,
category_id smallint NOT NULL,
flag character varying(1) NOT NULL DEFAULT ' '::character varying
)
WITH (
OIDS=FALSE
);
ALTER TABLE climate.measurement OWNER TO postgres;
CREATE OR REPLACE RULE i_measurement_01_001 AS
ON INSERT TO climate.measurement
WHERE date_part('month'::text, new.taken)::integer = 1 AND new.category_id = 1 DO INSTEAD INSERT INTO climate.measurement_01_001 (id, station_id, taken, amount, category_id, flag)
VALUES (new.id, new.station_id, new.taken, new.amount, new.category_id, new.flag);
数据最初在MySQL中,但出于性能原因必须切换到PostgreSQL(并利用
The data was originally in MySQL, but must be switched to PostgreSQL for performance reasons (and to leverage the PL/R extension).
谢谢!
推荐答案
拆分您可以将输入的内容输入数据库外部的单独文件中,然后使用COPY上传每个文件,而不是依靠规则来分发它们。如果您给出的规则是任何示例,那么这就是要应用的简单文本转换。另外,如果磁盘系统可以使用它,则可以通过预先拆分来并行加载拆分文件。
Split your input into separate files outside the database and upload each one using COPY, rather than relying on the rule to distribute them. If the rule you give is any example, that's a trivial text transformation to apply. Also, splitting up front will let you load the split files in parallel if your disk system is up to it.
严重的是,请勿依赖此规则散装货物的分配。除非您准备强行强行使用一种或另一种(通常是等待),否则通常总是存在大容量加载和事务加载需要不同方法的情况。
Seriously, don't rely on the rule to do this distribution for a bulk load. It's practically always the case that bulk load and transactional load need different approaches, unless you're prepared to brute-force one or the other (and, usually, wait).
例如,您的规则使用date_part()从日期中提取月份-因此,为了确定子表,postgres需要分析日期字符串,将其转换为时间戳,然后将时间戳转换回日历只是为了重新获得月份字段。但是,如果您要写一些东西来做此事,则可以执行 substr($ date,5,2)
(或同等学历):您认为这样做会更快?
For instance, your rule uses date_part() to extract the month from the date- so in order to determine the child table, postgres needs to analyse the date string, convert it to a timestamp, and then convert the timestamp back to a calendar just to get the month field back out again. But if you're writing something to do this upfront, you can just do substr($date,5,2)
(or equivalent): which do you think will be faster?
这也是清理数据格式的机会,因此COPY会接受。请注意,您可以使用COPY命令指定列:如果不使用该架构和示例文件来执行此操作,则由于前面有多余的 id列,因此会出现错误。 (从...复制带有csv标头可能已经弄清楚了,但也许没有...标头选项可能只是使其跳过了第一行。)
It's also an opportunity to clean up the data format so COPY will accept it. Note you can specify the columns with the COPY command: if you weren't doing that with that schema and example file, you'd get errors due to the extra "id" column on the front. ("copy from ... with csv header" may have figured that out, but maybe not... the "header" option may just make it skip the first line).
我自己在几个小时内就将大约280e6行加载到了一个postgresql实例中,因此这并非不可能。对于此初始负载,我已将fsync = off设置为off;计划是加载积压,然后将其重新打开以进行常规的每日加载。我必须设置checkpoint_segments = 40以避免在日志中收到检查点警告。这只是被加载到我的开发机器上-我正在使用数据库专用磁盘,该磁盘不同于用于xlogs的磁盘(即,我在大磁盘上创建了一个表空间,并在该表空间内创建了数据库)。该实例的shared_buffers设置为1Gb,而checkpoint_target设置为0.5。我尝试并行加载某些分区,但并没有提供太多改进,因此我怀疑慢速磁盘是瓶颈,而不是数据库本身。
I've just loaded about 280e6 rows into a postgresql instance myself in a few hours so it's certainly not impossible. For this initial load, I've turned fsync=off; the plan is to load the backlog and then turn it back on again for regular daily loads. I had to set checkpoint_segments=40 to avoid getting checkpoint warnings in the logs. This is just being loaded onto my dev machine- I'm using a dedicated disk for the database, which is different from the disk used for xlogs (i.e. I created a tablespace on the big disk and created the database inside that tablespace). The instance has shared_buffers set to 1Gb, and checkpoint_target set to 0.5. I tried loading some of the partitions in parallel and it didn't provide much improvement, so I suspect the slow disk is being the bottleneck rather than the DB itself.
还有1.7e9行要走...我希望明天能完成。
Just another 1.7e9 rows to go... should be finished tomorrow sometime I hope.
这篇关于批量插入数亿条记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!