问题描述
Redshift允许将多个列指定为SORTKEY
列,但是大多数最佳实践文档的编写都好像只有一个SORTKEY.
Redshift allows designating multiple columns as SORTKEY
columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.
如果我用SORTKEY (COL1, COL2)
创建一个表,是否意味着所有列都按COL1,然后按COL2排序存储?又或者,由于它是一列存储,所以每一列都以不同的顺序存储? IE.
If I create a table with SORTKEY (COL1, COL2)
, does that mean that all columns are stored sorted by COL1, then COL2? Or maybe, since it is a columnar store, each column gets stored in a different order? I.e. COL1 in COL1 order, COL2 in COL2 order, and the other columns unordered?
我的情况是,我有一个带有type_id和timestamp列的表.数据大致按时间戳顺序到达.大多数查询都是针对type_id和timestamp进行限制的.通常,type_id子句更具体,这意味着通过查看type_id子句可以比通过查看timestamp子句排除更多百分比的行.因此,type_id是DISTKEY.我试图了解SORTKEY (type_id)
,SORTKEY (stamp)
,SORTKEY (type_id,stamp)
,SORTKEY (stamp,type_id)
的优缺点.
My situation is that I have a table with (among others) a type_id and a timestamp column. Data arrives roughly in timestamp order. Most queries are joined against / restricted by both type_id and timestamp. Usually the type_id clauses are more specific, meaning a much larger percentage of rows can be excluded by looking at the type_id clause than by looking at the timestamp clause. type_id is the DISTKEY for this reason. I'm trying to understand the pros and cons of SORTKEY (type_id)
, SORTKEY (stamp)
, SORTKEY (type_id,stamp)
, SORTKEY (stamp,type_id)
.
谢谢.
推荐答案
如果声明SORTKEY(COL1, COL2)
,则所有列将按COL1
排序,然后按COL2
进行排序,就像完成ORDER BY (COL1, COL2)
一样.
If you declare SORTKEY(COL1, COL2)
, all columns will be sorted by COL1
, then COL2
as if ORDER BY (COL1, COL2)
was done.
如果您使用SORTKEY
来加快JOIN和AFAIU的速度,那么只要您在将要连接的表上使用相同的SORTKEY
即可,因为这是合并联接.
If you are using SORTKEY
to speed up a JOIN, AFAIU it doesn't matter so long as you use the same SORTKEY
on the tables that will be joined because what happens is a merge join.
如果COL1
与type_id
一样具有高度选择性,则意味着只有很少的行具有相同的type_id
.因此,尽管可以将另一列添加到SORTKEY中,但是由于大多数行消除已经发生,因此它的实用程序受到限制.
If COL1
is highly selective like your type_id
, it means there are only small numbers of rows which has the same type_id
. Therefore although you can add another column to SORTKEY, its utility is limited because most of the row elimination has already happened.
如果COL1
不像您的stamp
那样具有高度选择性(顺便说一句;我希望它比type_id
更具选择性?无论如何..),这意味着按不会消除那么多行.因此,声明第二个排序键更有意义.但是,这比其他方法效率低,因为尽早消除行会更便宜.如果您有时通过stamp
进行过滤,但没有通过type_id
进行过滤,那么这样做可能是有道理的.
If COL1
is not highly selective like your stamp
(which is a bit weird btw; I would have expected it to be more selective than type_id
? Anyways..), it means that filtering by stamp
won't eliminate that much rows. So it makes more sense to declare a second sort key. However, this is less efficient than the other way around as eliminating rows earlier would be cheaper. If you sometimes filter by stamp
but not by type_id
, it may make sense to do this though.
这篇关于拥有多个sortkey列是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!