PostgreSQL在两个值之间联接

PostgreSQL在两个值之间联接

本文介绍了PostgreSQL在两个值之间联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有下表,正在尝试查找数十万个城市的县代码。

I have the following tables and am trying to look up county codes for a list of several hundred thousand cities.

create table counties (
  zip_code_from  char(5) not null,
  zip_code_thru  char(5) not null,
  county_code    char(3) not null
);

create table cities (
  city      text    not null,
  zip_code  char(5) not null
);

我的第一种方法是在联接中使用 between:

My first approach was using a "between" in the join:

select
  ci.city, ci.zip_code, co.county_code
from
  cities ci
  join counties co on
    co.zip_code between ci.zip_code_from and ci.zip_code_thru

我在Oracle中知道在世界上,这是令人皱眉的,确实,表演似乎很惨。处理大约16,000个城市需要8分钟以上的时间。邮政编码表约有80,000条记录。我猜这种语法是一种美化的交叉联接?

I know in the Oracle world, this was frowned upon, and indeed the performance appears to be miserable. It takes over 8 minutes to process around 16,000 cities. The zip code table has around 80,000 records. I'm guessing that this syntax is a glorified cross-join?

from和thru代码都已编入索引,并且我可以控制结构,因此可以更改

Both the from and thru codes are indexed, and I have control over the structures, so I can change the table if it helps.

我唯一的想法是继续将表格扩展到所有可能的值-类似于此:

My only other thought is to go ahead and expand the table out to all possible values -- something similar to this:

select
  generate_series (
    cast (zip_code_from as int),
    cast (zip_code_thru as int)
  ) as zip_code,
  *
from counties

这将扩展数据可以处理超过200,000条记录,这没什么大不了的,但是我不确定这是否是我查询不可怕的唯一手段。

This would expand the data to over 200,000 records, which isn't a big deal, but I wasn't sure if this is my only recourse to have queries that aren't horrible.

I '我猜想即使是即时执行并且没有索引也比连接中的 更可取,但是我希望有另外一种选择,无论是我的SQL和/或我可以处理的表本身的结构。

I'm guessing that even doing that on the fly and not having indexes would be preferable to the between in my join, but I was hoping there is an alternative, either in terms of my SQL and/or something I can do with the structure of the table itself.

我已经看到这个问题是在其他DBMS平台上发布的,但是我已经能够在其他数据库中不可能(或不可行)地实现PostgreSQL的微型奇迹,所以我希望能有所作为我错过了。

I've seen this question posted for other DBMS platforms, but I've been able to pull off mini-miracles with PostgreSQL that weren't possible (or practical) in other databases, so I was hopeful there is something I've missed.

推荐答案

一个月后,这种情况再次浮出水面,我决定测试一些理论。

Months later, this has cropped its head again, and I decided to test some of my theories.

原始查询:

select
  ci.city, ci.zip_code, co.fips_code
from
  cities ci
  join counties co on
    ci.zip_code between co.from_zip_code and co.thru_zip_code

实际上执行笛卡尔坐标。该查询返回34,000行,耗时597秒。

Does in fact implement a cartesian. The query returns 34,000 rows and takes 597 seconds.

如果我预爆炸邮政编码,则分为离散记录:

If I "pre-explode" the zip code ranges into discrete records:

with exploded_zip as (
  select
    generate_series (
      cast (from_zip_code as int),
      cast (thru_zip_code as int)
    )::text as zip_code,
    *
  from counties
)
select
  ci.city, ci.zip_code, co.fips_code
from
  cities ci
  join exploded_zip co on
    ci.zip_code = co.zip_code

查询返回的行完全相同,但在2.8秒内完成。

The query returns the exact same rows but finishes in 2.8 seconds.

因此,最重要的是使用在连接中(之间或任何不等式之间)是一个非常糟糕的主意。

So it seems the bottom line is that using a between in a join (or any inequality) is a really bad idea.

这篇关于PostgreSQL在两个值之间联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 16:30