SELECT DISTINCT cql忽略WHERE子句

本文介绍了SELECT DISTINCT cql忽略WHERE子句的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

执行两个相同的请求，但DISTINCT关键字给出了意外的结果。如果没有关键字，则结果可以，但是使用DISTINCT时，看起来好像where子句被忽略了。为什么？

Executing two identical requests but the DISTINCT keyword gives unexpected results. Without the keyword, the result is ok but with DISTINCT, it looks like the where clause is ignored. Why ?

Cqlsh版本：

Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]

所考虑的表：

DESCRIBE TABLE events;

CREATE TABLE events (
  userid uuid,
  "timestamp" timestamp,
  event_type text,
  data text,
  PRIMARY KEY (userid, "timestamp", event_type)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

表内容：

SELECT * FROM events;

 userid                               | timestamp                | event_type | data
--------------------------------------+--------------------------+------------+------
 aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:07:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:08:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:09:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:10:17+0100 |       toto | null

(6 rows)

Request1：不带DISTINCT的请求

Request1: Request without DISTINCT

SELECT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;

 userid
--------------------------------------
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e

(3 rows)

Request2：与DISTINCT相同的请求

Request2: Same request with DISTINCT

SELECT DISTINCT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;

 userid
--------------------------------------
 aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e

(2 rows)

编辑1

这里是一些上下文。

该表事件，它受到很多写入的影响，每秒接收大约1k次插入，并且我有一个批处理脚本，每5分钟检查一次这些事件。

此批处理脚本有2个需求：

1-获取最近5分钟内处于活动状态的所有用户ID（即最近5分钟内事件中存在的每个用户ID）

2-获取与这些用户ID相关的所有事件（不仅适用于最后5分钟）

EDIT 1
here is some context.
This table "events" is subject to a lot of writes, it receives around ~1k insertions per second and I have a batch script that checks those events every 5 minutes.
This batch script has 2 needs:
1- get all userids that have been active in the last 5 minutes (i.e every userid present in the events from the last 5 minutes)
2- get all events related to those userids (not only for the last 5 minutes)

我以前有两个不同的表来处理这个问题。一个表 activeusers用于第一个请求，而事件表就像我在这里为第二个请求所述。我的问题在于，它需要服务器接收事件时在两个不同的表中进行写入。因此，我仅使用事件表尝试了此操作。

I used to have two different tables to handle this. One table "activeusers" for the first request and the "events" table like I have described here for the second request. My problem with that is just that it requires from my server to write in two different tables when it receives an event. So I tried this using only the events table.

推荐答案

之所以会这样，是因为在Cassandra中，CQL DISTINCT 旨在仅返回表（列族）的分区（行）键...必须唯一。因此， WHERE 子句仅在与 DISTINCT 一起使用时才可以对分区键进行操作（在您的情况下不是）非常有用）。如果将 DISTINCT 取出，则可以使用 WHERE 来评估每个分区键中的聚类（列）键（尽管使用允许过滤）。

It happens that way because in Cassandra CQL DISTINCT is designed to return only the partition (row) keys of your table (column family)...which must be unique. Therefore, the WHERE clause can only operate on partition keys when used with DISTINCT (which in your case, isn't terribly useful). If you take the DISTINCT out, WHERE can then be used to evaluate the clustering (column) keys within each partition key (albeit, with ALLOW FILTERING).

我不得不提到允许过滤不是您应该做的所有事情……而且绝对不在生产中。如果该查询是您需要经常运行的查询（在某个时间戳记之后查询用户ID 的事件），那么我建议改为按 event_type 分区数据：

I feel compelled to mention that ALLOW FILTERING is not something you should be doing a whole lot of...and definitely not in production. If that query is one you need to run often (querying events for userids after a certain timestamp) then I would suggest partitioning your data by event_type instead:

PRIMARY KEY (event_type, "timestamp", userid)

然后，您无需允许过滤。

SELECT userid FROM events WHERE event_type='toto' AND timestamp > '1970-01-17 09:07:17+0100'

不了解您的应用程序或用例，对您可能有用也可能没有用。但请以它为例，这表明可能有更好的方法来构建模型来满足您的查询模式。查阅有关如何为该问题建模的更多想法。

Without knowing anything about your application or use case, that may or may not be useful to you. But consider it as an example, and as an indication that there may be a better way build your model to satisfy your query pattern(s). Check out Patrick McFadin's article on timeseries data modeling for more ideas on how to model for this problem.

这篇关于SELECT DISTINCT cql忽略WHERE子句的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

userIds