问题描述
我的表有超过450万行,而 SELECT
查询对于我的需求来说太慢了。
I have a table with > 4.5 million rows and my SELECT
query is far too slow for my needs.
创建表的方式为:
CREATE TABLE all_legs (
carrier TEXT,
dep_hub TEXT,
arr_hub TEXT,
dep_dt TIMESTAMP WITH TIME ZONE,
arr_dt TIMESTAMP WITH TIME ZONE,
price_ct INTEGER,
... 5 more cols ...,
PRIMARY KEY (carrier, dep_hub, arr_hub, dep_dt, arr_dt, ...3 other cols...)
)
当我想 SELECT
所有行在特定日期时,查询速度太慢;大约需要12秒到20秒。我的目标是最多花费1秒。我希望查询返回表中包含的行的0.1%到1%之间。
When I want to SELECT
all rows for a certain date, the query is too slow; it takes between 12sec and 20 seconds. My aim is that it takes max 1 sec. I expect the query to return between 0.1% and 1% of the rows contained in the table.
查询非常简单:
SELECT * FROM all_legs WHERE dep_dt::date = '2017-08-15' ORDER BY price_ct ASC
解释分析
返回:
Sort (cost=197154.69..197212.14 rows=22982 width=696) (actual time=14857.300..14890.565 rows=31074 loops=1)
Sort Key: price_ct
Sort Method: external merge Disk: 5256kB
-> Seq Scan on all_legs (cost=0.00..188419.85 rows=22982 width=696) (actual time=196.738..14581.143 rows=31074 loops=1)
Filter: ((dep_dt)::date = '2017-08-15'::date)
Rows Removed by Filter: 4565249
Planning time: 0.572 ms
Execution time: 14908.274 ms
注意:昨天我了解了此命令,因此我仍然无法完全理解返回的所有内容。
我曾尝试使用仅索引扫描
,如建议的,通过运行以下命令:在all_legs(dep_dt)上创建索引idx_all_legs;
,但是运行时间没有任何差异。我还尝试为所有列创建索引,因为我希望所有列都返回。
I have tried using index-only scans
, as suggested here, by running the command: CREATE index idx_all_legs on all_legs(dep_dt);
but I did not notice any difference in running time. I also tried creating the index for all columns, as I want all columns return.
另一个想法是按 dep_dt $ c对所有行进行排序$ c>,因此满足条件的所有行的搜索应该更快,因为它们不会分散。不幸的是,我不知道该如何实现。
Another thought was sorting all rows by dep_dt
, so then the search of all rows fulfilling the condition should be much faster as they would not be scattered. Unfortunately, I don't know how to implement this.
有没有一种方法可以使它达到我想要的速度?
Is there a way to make it as fast as I am aiming to?
如,方法是添加索引如果不存在则创建索引idx_dep_dt_price启用all_legs(dep_dt,price_ct);
并修改 SELECT
到 WHERE dep_dt> ='2017-08-15 00:00:00'并且dep_dt< ‘2017-08-16 00:00:00’
将运行时间减少到1/4。即使这是非常好的改进,也意味着运行时间在2到6秒之间。
As suggested in the Laurenz' answer, by adding an index CREATE INDEX IF NOT EXISTS idx_dep_dt_price ON all_legs(dep_dt, price_ct);
and adapting the condition in the SELECT
to WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00'
has reduced the running time to 1/4. Even if it is a very good improvement, that means running times between 2 and 6 seconds.
任何进一步缩短运行时间的想法都将受到赞赏。
Any additional idea to reduce the running time even further would be appreciated.
推荐答案
索引将无济于事。
两种解决方案:
-
您应该将查询更改为:
You chould either change the query to:
WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00'
然后可以使用索引。
在表达式上创建索引:
CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date));
(或其他时区)并将查询更改为
(or a different time zone) and change the query to
WHERE (dep_dt AT TIME ZONE 'UTC')::date = '2017-08-16'
AT时区
是必需的,因为否则转换的结果将取决于您当前的 TimeZone
设置。
The AT TIME ZONE
is necessary because otherwise the result of the cast would depend on your current TimeZone
setting.
第一个解决方案比较简单,但是第二个解决方案优点是您可以像这样将 price_ct
添加到索引:
The first solution is simpler, but the second has the advantage that you can add price_ct
to the index like this:
CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date), price_ct);
然后,您不再需要任何排序,查询将尽可能快理论上可以得到。
Then you don't need a sort any more, and your query will be as fast as it can theoretically get.
这篇关于PostgreSQL:加快具有数百万行的表中的SELECT查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!