问题描述
我使用Cassandra 2.1.2与对应的DataStax Java驱动程序和DataStax提供的对象映射。
下面的表定义:
CREATE TABLE如果不存在ses.tim(id text PRIMARY KEY,start bigint,cid int);
映射:
@Table(keyspace =ses,name =tim)
class MyObj {
@PartitionKey
private String id;
private长启动;
private int cid;
}
存取器
@Accessor
接口MyAccessor {
@Query(SELECT * FROM ses.tim WHERE id =:iid)
MyObj get(@Param iid)String id);
@Query(SELECT * FROM ses.tim WHERE start< =:sstart)
结果< MyObj> get(@Param(sstart)long start);
}
我想做一个查询,
有了这个表的定义是不可能的,所以我试着创建一个二级索引:
CREATE INDEX IF NOT EXISTS myindex ON ses.tim(start);
这似乎是不太好(我读了很多解释为什么它决定不支持这,但我仍然不明白为什么有人会给这样的限制,无论如何..)
所以,根据我的理解,我们必须在WHERE子句中至少有一个等于
@Query(SELECT * FROM ses.tim WHERE cid =:ccid AND start< =:sstart)
CREATE INDEX如果不存在myindex2 ON ses.tim (cid);
如果这将工作,我必须知道cid的所有可能的值,其余的在客户端...但我得到的错误是
无法执行此查询,因为它可能涉及数据过滤,因此可能有不可预测的表现
然后我尝试了
id text,start bigint,cid int,PRIMARY KEY(id,start,cid)
与
@Table(keyspace =ses,name =tim)
class MyObj {
@PartitionKey
private String id;
@ClusteringColumn(0)
private长启动;
@ClusteringColumn(1)
private int cid;
}
但还是没有运气。
此外,我试图设置开始作为PartitionKey,但只有可能再次与Equals查询...
我缺少什么?
EDIT:更新版本以更正一个
如果对同一组数据有不同的查询能力需求,您可以考虑对数据进行反规范化。根据您的问题,它听起来像你想要的:
- 查询
id
- 查询
开始
< X
第一个查询与您目前的模式一样正常运作。然而,第二个查询不能按原样工作,没有辅助索引,由于您已经调查的原因,它会很慢(我总是指向以创建进行查询的spark作业。连接器会将昂贵的范围查询分解为较小的任务,并将数据映射到RDD,从而允许您灵活地进行更复杂的查询,并获得良好的性能。
I'm using Cassandra 2.1.2 with the corresponding DataStax Java driver and the Object mapping provided by DataStax.
following table definition:
CREATE TABLE IF NOT EXISTS ses.tim (id text PRIMARY KEY, start bigint, cid int);
the mapping:
@Table(keyspace = "ses", name = "tim")
class MyObj {
@PartitionKey
private String id;
private Long start;
private int cid;
}
the accessor
@Accessor
interface MyAccessor {
@Query("SELECT * FROM ses.tim WHERE id = :iid")
MyObj get(@Param("iid") String id);
@Query(SELECT * FROM ses.tim WHERE start <= :sstart")
Result<MyObj> get(@Param("sstart") long start);
}
as indicated within the accessor I want to do a query that returns everything where 'start' is smaller or equal than a specific value.
With this definition of the table it's simply not possible. Therefore I tried creating a secondary index:
CREATE INDEX IF NOT EXISTS myindex ON ses.tim (start);
this seems to be not working as well (I read a lot of explanations why its decided to not support this, but I still don't understand why somebody would give such restrictions, anyhow..)
so, as far as I understandd, we have to have at least one equals in the WHERE clause
@Query(SELECT * FROM ses.tim WHERE cid = :ccid AND start <= :sstart")
CREATE INDEX IF NOT EXISTS myindex2 ON ses.tim (cid);
if this would work I would have to know ALL possible values for cid, and query them separately and do the rest on the client... but the error I get is
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance
then I tried
id text, start bigint, cid int, PRIMARY KEY (id, start, cid)
with
@Table(keyspace = "ses", name = "tim")
class MyObj {
@PartitionKey
private String id;
@ClusteringColumn(0)
private Long start;
@ClusteringColumn(1)
private int cid;
}
but still without luck.
furthermore, I tried to set 'start' as PartitionKey, but it's only possible to query with Equals again...
what am I missing? how can I achieve getting results for this type of query?
EDIT: version updated to correct one
You could consider denormalizing your data if you have different query-ability needs for the same set of data. Based on your question, it sounds like you want the following:
- Query by
id
- Query by
start
< X
The first query works fine as you indicated with your current schema. The second query however cannot work as is without a secondary index which will be slow for reasons you have already investigated (I always point to this blog post with respect to secondary indexes.
You indicated that you did not want to partition on cid
since you would need to know all possible values for cid
.
Three ideas I can think of:
Create a separate table with a dummy primary key so all of your data is stored in the same partition. This can be problematic though if you have many entries creating a super-wide partition and hotspots on whatever nodes hold that data. How many do you plan on having?
create table if not exists tim ( dummy int, start bigint, id text, cid int, primary key (dummy, start) );
You could then make queries like:
select * from tim where dummy=0 and start <= 10;
The other option is to use ALLOW FILTERING on your original table which will still do an expensive range query and filter through the data.
select * from tim where start <= 10 ALLOW FILTERING;
Another option is to use something like the spark-connector to create a spark job that makes the query. The connector will break up an expensive range query into smaller tasks and map the data to RDDs, allowing you flexibility to make more complex queries with good performance.
这篇关于范围查询在Cassandra的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!