问题描述
我有下表,正在尝试查找数十万个城市的县代码。
I have the following tables and am trying to look up county codes for a list of several hundred thousand cities.
create table counties (
zip_code_from char(5) not null,
zip_code_thru char(5) not null,
county_code char(3) not null
);
create table cities (
city text not null,
zip_code char(5) not null
);
我的第一种方法是在联接中使用 between:
My first approach was using a "between" in the join:
select
ci.city, ci.zip_code, co.county_code
from
cities ci
join counties co on
co.zip_code between ci.zip_code_from and ci.zip_code_thru
我在Oracle中知道在世界上,这是令人皱眉的,确实,表演似乎很惨。处理大约16,000个城市需要8分钟以上的时间。邮政编码表约有80,000条记录。我猜这种语法是一种美化的交叉联接?
I know in the Oracle world, this was frowned upon, and indeed the performance appears to be miserable. It takes over 8 minutes to process around 16,000 cities. The zip code table has around 80,000 records. I'm guessing that this syntax is a glorified cross-join?
from和thru代码都已编入索引,并且我可以控制结构,因此可以更改
Both the from and thru codes are indexed, and I have control over the structures, so I can change the table if it helps.
我唯一的想法是继续将表格扩展到所有可能的值-类似于此:
My only other thought is to go ahead and expand the table out to all possible values -- something similar to this:
select
generate_series (
cast (zip_code_from as int),
cast (zip_code_thru as int)
) as zip_code,
*
from counties
这将扩展数据可以处理超过200,000条记录,这没什么大不了的,但是我不确定这是否是我查询不可怕的唯一手段。
This would expand the data to over 200,000 records, which isn't a big deal, but I wasn't sure if this is my only recourse to have queries that aren't horrible.
I '我猜想即使是即时执行并且没有索引也比连接中的
更可取,但是我希望有另外一种选择,无论是我的SQL和/或我可以处理的表本身的结构。
I'm guessing that even doing that on the fly and not having indexes would be preferable to the between
in my join, but I was hoping there is an alternative, either in terms of my SQL and/or something I can do with the structure of the table itself.
我已经看到这个问题是在其他DBMS平台上发布的,但是我已经能够在其他数据库中不可能(或不可行)地实现PostgreSQL的微型奇迹,所以我希望能有所作为我错过了。
I've seen this question posted for other DBMS platforms, but I've been able to pull off mini-miracles with PostgreSQL that weren't possible (or practical) in other databases, so I was hopeful there is something I've missed.
推荐答案
一个月后,这种情况再次浮出水面,我决定测试一些理论。
Months later, this has cropped its head again, and I decided to test some of my theories.
原始查询:
select
ci.city, ci.zip_code, co.fips_code
from
cities ci
join counties co on
ci.zip_code between co.from_zip_code and co.thru_zip_code
实际上执行笛卡尔坐标。该查询返回34,000行,耗时597秒。
Does in fact implement a cartesian. The query returns 34,000 rows and takes 597 seconds.
如果我预爆炸邮政编码,则分为离散记录:
If I "pre-explode" the zip code ranges into discrete records:
with exploded_zip as (
select
generate_series (
cast (from_zip_code as int),
cast (thru_zip_code as int)
)::text as zip_code,
*
from counties
)
select
ci.city, ci.zip_code, co.fips_code
from
cities ci
join exploded_zip co on
ci.zip_code = co.zip_code
查询返回的行完全相同,但在2.8秒内完成。
The query returns the exact same rows but finishes in 2.8 seconds.
因此,最重要的是使用在连接中(
之间或任何不等式之间)是一个非常糟糕的主意。
So it seems the bottom line is that using a between
in a join (or any inequality) is a really bad idea.
这篇关于PostgreSQL在两个值之间联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!