加快具有数百万行的表中的SELECT查询

加快具有数百万行的表中的SELECT查询

本文介绍了PostgreSQL:加快具有数百万行的表中的SELECT查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的表有超过450万行,而 SELECT 查询对于我的需求来说太慢了。

I have a table with > 4.5 million rows and my SELECT query is far too slow for my needs.

创建表的方式为:

CREATE TABLE all_legs (
                carrier TEXT,
                dep_hub TEXT,
                arr_hub TEXT,
                dep_dt TIMESTAMP WITH TIME ZONE,
                arr_dt TIMESTAMP WITH TIME ZONE,
                price_ct INTEGER,
                ... 5 more cols ...,
                PRIMARY KEY (carrier, dep_hub, arr_hub, dep_dt, arr_dt, ...3 other cols...)
                )

当我想 SELECT 所有行在特定日期时,查询速度太慢;大约需要12秒到20秒。我的目标是最多花费1秒。我希望查询返回表中包含的行的0.1%到1%之间。

When I want to SELECT all rows for a certain date, the query is too slow; it takes between 12sec and 20 seconds. My aim is that it takes max 1 sec. I expect the query to return between 0.1% and 1% of the rows contained in the table.

查询非常简单:

SELECT * FROM all_legs WHERE dep_dt::date = '2017-08-15' ORDER BY price_ct ASC

解释分析返回:

Sort  (cost=197154.69..197212.14 rows=22982 width=696) (actual time=14857.300..14890.565 rows=31074 loops=1)
  Sort Key: price_ct
  Sort Method: external merge  Disk: 5256kB
  ->  Seq Scan on all_legs  (cost=0.00..188419.85 rows=22982 width=696) (actual time=196.738..14581.143 rows=31074 loops=1)
        Filter: ((dep_dt)::date = '2017-08-15'::date)
        Rows Removed by Filter: 4565249
Planning time: 0.572 ms
Execution time: 14908.274 ms

注意:昨天我了解了此命令,因此我仍然无法完全理解返回的所有内容。

我曾尝试使用仅索引扫描,如建议的,通过运行以下命令:在all_legs(dep_dt)上创建索引idx_all_legs; ,但是运行时间没有任何差异。我还尝试为所有列创建索引,因为我希望所有列都返回。

I have tried using index-only scans, as suggested here, by running the command: CREATE index idx_all_legs on all_legs(dep_dt); but I did not notice any difference in running time. I also tried creating the index for all columns, as I want all columns return.

另一个想法是按 dep_dt ,因此满足条件的所有行的搜索应该更快,因为它们不会分散。不幸的是,我不知道该如何实现。

Another thought was sorting all rows by dep_dt, so then the search of all rows fulfilling the condition should be much faster as they would not be scattered. Unfortunately, I don't know how to implement this.

有没有一种方法可以使它达到我想要的速度?

Is there a way to make it as fast as I am aiming to?

如,方法是添加索引如果不存在则创建索引idx_dep_dt_price启用all_legs(dep_dt,price_ct); 并修改 SELECT WHERE dep_dt> ='2017-08-15 00:00:00'并且dep_dt< ‘2017-08-16 00:00:00’将运行时间减少到1/4。即使这是非常好的改进,也意味着运行时间在2到6秒之间。

As suggested in the Laurenz' answer, by adding an index CREATE INDEX IF NOT EXISTS idx_dep_dt_price ON all_legs(dep_dt, price_ct); and adapting the condition in the SELECT to WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00' has reduced the running time to 1/4. Even if it is a very good improvement, that means running times between 2 and 6 seconds.

任何进一步缩短运行时间的想法都将受到赞赏。

Any additional idea to reduce the running time even further would be appreciated.

推荐答案

索引将无济于事。

两种解决方案:


  1. 您应该将查询更改为:

  1. You chould either change the query to:

WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00'

然后可以使用索引。

在表达式上创建索引:

CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date));

(或其他时区)并将查询更改为

(or a different time zone) and change the query to

WHERE (dep_dt AT TIME ZONE 'UTC')::date = '2017-08-16'

AT时区是必需的,因为否则转换的结果将取决于您当前的 TimeZone 设置。

The AT TIME ZONE is necessary because otherwise the result of the cast would depend on your current TimeZone setting.

第一个解决方案比较简单,但是第二个解决方案优点是您可以像这样将 price_ct 添加到索引:

The first solution is simpler, but the second has the advantage that you can add price_ct to the index like this:

CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date), price_ct);

然后,您不再需要任何排序,查询将尽可能快理论上可以得到。

Then you don't need a sort any more, and your query will be as fast as it can theoretically get.

这篇关于PostgreSQL:加快具有数百万行的表中的SELECT查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 01:09