Cassandra分组和过滤结果

Cassandra分组和过滤结果

本文介绍了Cassandra分组和过滤结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图模仿这样的:
给出一个表 test

CREATE TABLE myspace.test (
item_id text,
sub_id text,
quantity bigint,
status text,
PRIMARY KEY (item_id, sub_id)



在SQL中,我们可以:

In SQL, we could do:

select * from (select item_id, sum(quantity) as quan
               from test where status <> 'somevalue') sub
where sub.quan >= 10;

即分组 item_id ,然后过滤掉小于10的结果。

i.e. group by item_id and then filter out the results with less than 10.

Cassandra不是为这种东西设计的,虽然我可以模仿 group by 使用:

Cassandra is not designed for this kind of stuff though I could mimic group by using user-defined aggregate functions:

CREATE FUNCTION group_sum_state
   (state map<text, bigint>, item_id text, val bigint)
CALLED ON NULL INPUT
RETURNS map<text, bigint>
LANGUAGE java
AS $$Long current = (Long)state.get(item_id);
if(current == null) current = 0l;
state.put(item_id, current + val); return state;$$;

CREATE AGGREGATE group_sum(text, bigint)
SFUNC group_sum_state
STYPE map<text, bigint>
INITCOND {  }

使用 code>(可能这将是非常糟糕的性能,但仍然):

And use it as group by (probably this is going to have very bad performance, but still):

cqlsh:myspace> select group_sum(item_id, quantity) from test;

mysales_data.group_sum(item_id, quantity)
-------------------------------------------
     {'123': 33, '456': 14, '789': 15}

但是似乎不可能通过映射值进行过滤,既不能使用聚合的final函数,也不能使用单独的函数。我可以定义一个这样的函数:

But it seems to be impossible to do filtering by map values, neither with final function for the aggregate nor with a separate function. I could define a function like this:

CREATE FUNCTION myspace.filter_group_sum
                (group map<text, bigint>, vallimit bigint)
CALLED ON NULL INPUT
RETURNS map<text, bigint>
LANGUAGE java
AS $$
java.util.Iterator<java.util.Map.Entry<String, Long>> entries =
               group.entrySet().iterator();
while(entries.hasNext()) {
    Long val = entries.next().getValue();
    if (val < vallimit)
        entries.remove();
};
return group;$$;

但是没有办法调用它并传递一个常数:

But there is no way to call it and pass a constant:

select filter_group_sum(group_sum(item_id, quantity), 15) from test;
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query]
message="line 1:54 no viable alternative at input '15'
(...(group_sum(item_id, quantity), [15]...)">

它抱怨常数 15

对不起的长篇文章,我需要提供所有的细节来解释我需要什么
所以我的问题是:

Sorry for the long post, I need to provide all the details to explain what I need.So my questions are:


  1. 在Cassandra中有一个常量传递给用户定义的函数
  2. 更一般的问题:什么是适当的数据设计Cassandra覆盖这样的用例一个真实的 - 时间查询服务应用程序?说我有一个Web应用程序从UI界限,并需要返回所有的项目总额数量大于给定的限制?

  1. Is there a way pass in a constant to a user-defined function in Cassandra. Or what alternatives do I have to implemented filtered group by.
  2. More general question: what is the proper data design for Cassandra to cover such a use-case for a real-time query-serving application? Say I have a web app that takes the limit from the UI and needs to return back all the items that total quantity larger than the given limit? The tables are going to quite large, like 10 billions of records.


推荐答案

Vanilla Cassandra是特别查询的一个不好的选择。 DataStax Enterprise通过与Spark和Solr的集成添加了一些此功能。 Spark集成也是开源的,但是您不想为低延迟查询执行此操作。如果你需要实时查询,你将不得不在Cassandra之外进行聚合(例如在Spark或Storm中),然后写回应用程序要消耗的聚合。您还可以查看Stratio的,这可能会帮助您解决部分查询。

Vanilla Cassandra is a poor choice for ad hoc queries. DataStax Enterprise has added some of this functionality via integrations with Spark and Solr. The Spark integration is also open source, but you wouldn't want to do this for low-latency queries. If you need real-time queries, you're going to have to aggregate outside of Cassandra (in Spark or Storm, for example), then write back the aggregates to be consumed by your app. You can also look at Stratio's Lucene integration, which might help you for some of your queries.

这篇关于Cassandra分组和过滤结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 05:44