问题描述
Redshift 允许将多个列指定为 SORTKEY
列,但大多数最佳实践文档的编写方式就好像只有一个 SORTKEY.
Redshift allows designating multiple columns as SORTKEY
columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.
如果我用 SORTKEY (COL1, COL2)
创建一个表,这是否意味着所有列都按 COL1 排序,然后按 COL2 排序?或者,因为它是一个列式存储,所以每一列都以不同的顺序存储?IE.COL1按COL1顺序,COL2按COL2顺序,其他列无序?
If I create a table with SORTKEY (COL1, COL2)
, does that mean that all columns are stored sorted by COL1, then COL2? Or maybe, since it is a columnar store, each column gets stored in a different order? I.e. COL1 in COL1 order, COL2 in COL2 order, and the other columns unordered?
我的情况是我有一个表(其中包括)一个 type_id 和一个时间戳列.数据大致按时间戳顺序到达.大多数查询都受到 type_id 和时间戳的连接/限制.通常 type_id 子句更具体,这意味着通过查看 type_id 子句比查看时间戳子句可以排除更大比例的行.由于这个原因,type_id 是 DISTKEY.我试图了解 SORTKEY (type_id)
、SORTKEY (stamp)
、SORTKEY (type_id,stamp)
、?无论如何......),这意味着通过 stamp
过滤不会消除那么多行.所以声明第二个排序键更有意义.然而,这比其他方式效率低,因为提前消除行会更便宜.如果您有时按 stamp
而不是按 type_id
过滤,那么这样做可能是有意义的.
If COL1
is not highly selective like your stamp
(which is a bit weird btw; I would have expected it to be more selective than type_id
? Anyways..), it means that filtering by stamp
won't eliminate that much rows. So it makes more sense to declare a second sort key. However, this is less efficient than the other way around as eliminating rows earlier would be cheaper. If you sometimes filter by stamp
but not by type_id
, it may make sense to do this though.
这篇关于有多个 sortkey 列是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!