问题描述
执行两个相同的请求,但DISTINCT关键字给出了意外的结果。如果没有关键字,则结果可以,但是使用DISTINCT时,看起来好像where子句被忽略了。为什么?
Executing two identical requests but the DISTINCT keyword gives unexpected results. Without the keyword, the result is ok but with DISTINCT, it looks like the where clause is ignored. Why ?
Cqlsh版本:
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
所考虑的表:
DESCRIBE TABLE events;
CREATE TABLE events (
userid uuid,
"timestamp" timestamp,
event_type text,
data text,
PRIMARY KEY (userid, "timestamp", event_type)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
表内容:
SELECT * FROM events;
userid | timestamp | event_type | data
--------------------------------------+--------------------------+------------+------
aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:07:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:08:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:09:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:10:17+0100 | toto | null
(6 rows)
Request1:不带DISTINCT的请求
Request1: Request without DISTINCT
SELECT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;
userid
--------------------------------------
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
(3 rows)
Request2:与DISTINCT相同的请求
Request2: Same request with DISTINCT
SELECT DISTINCT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;
userid
--------------------------------------
aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
(2 rows)
编辑1
这里是一些上下文。
该表事件,它受到很多写入的影响,每秒接收大约1k次插入,并且我有一个批处理脚本,每5分钟检查一次这些事件。
此批处理脚本有2个需求:
1-获取最近5分钟内处于活动状态的所有用户ID(即最近5分钟内事件中存在的每个用户ID)
2-获取与这些用户ID相关的所有事件(不仅适用于最后5分钟)
EDIT 1
here is some context.
This table "events" is subject to a lot of writes, it receives around ~1k insertions per second and I have a batch script that checks those events every 5 minutes.
This batch script has 2 needs:
1- get all userids that have been active in the last 5 minutes (i.e every userid present in the events from the last 5 minutes)
2- get all events related to those userids (not only for the last 5 minutes)
我以前有两个不同的表来处理这个问题。一个表 activeusers用于第一个请求,而事件表就像我在这里为第二个请求所述。我的问题在于,它需要服务器接收事件时在两个不同的表中进行写入。因此,我仅使用事件表尝试了此操作。
I used to have two different tables to handle this. One table "activeusers" for the first request and the "events" table like I have described here for the second request. My problem with that is just that it requires from my server to write in two different tables when it receives an event. So I tried this using only the events table.
推荐答案
之所以会这样,是因为在Cassandra中,CQL DISTINCT
旨在仅返回表(列族)的分区(行)键...必须唯一。因此, WHERE
子句仅在与 DISTINCT
一起使用时才可以对分区键进行操作(在您的情况下不是)非常有用)。如果将 DISTINCT
取出,则可以使用 WHERE
来评估每个分区键中的聚类(列)键(尽管使用允许过滤
)。
It happens that way because in Cassandra CQL DISTINCT
is designed to return only the partition (row) keys of your table (column family)...which must be unique. Therefore, the WHERE
clause can only operate on partition keys when used with DISTINCT
(which in your case, isn't terribly useful). If you take the DISTINCT
out, WHERE
can then be used to evaluate the clustering (column) keys within each partition key (albeit, with ALLOW FILTERING
).
我不得不提到允许过滤
不是您应该做的所有事情……而且绝对不在生产中。如果该查询是您需要经常运行的查询(在某个时间戳记
之后查询用户ID
的事件),那么我建议改为按 event_type
分区数据:
I feel compelled to mention that ALLOW FILTERING
is not something you should be doing a whole lot of...and definitely not in production. If that query is one you need to run often (querying events for userids
after a certain timestamp
) then I would suggest partitioning your data by event_type
instead:
PRIMARY KEY (event_type, "timestamp", userid)
然后,您无需允许过滤。
SELECT userid FROM events WHERE event_type='toto' AND timestamp > '1970-01-17 09:07:17+0100'
不了解您的应用程序或用例,对您可能有用也可能没有用。但请以它为例,这表明可能有更好的方法来构建模型来满足您的查询模式。查阅有关如何为该问题建模的更多想法。
Without knowing anything about your application or use case, that may or may not be useful to you. But consider it as an example, and as an indication that there may be a better way build your model to satisfy your query pattern(s). Check out Patrick McFadin's article on timeseries data modeling for more ideas on how to model for this problem.
这篇关于SELECT DISTINCT cql忽略WHERE子句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!