本文介绍了Cassandra - 重叠数据范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Cassandra 中有以下任务"表.

I have the following 'Tasks' table in Cassandra.

  • Task_ID UUID - 分区键
  • Starts_On TIMESTAMP - 聚类列
  • Ends_On TIMESTAMP - 聚类列

我想运行 CQL 查询以获取给定日期范围内的重叠任务.例如,如果我将两个时间戳(T1 和 T2)作为参数传递给查询,我想获取该范围内适用的所有任务(即重叠记录).

I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).

在 Cassandra 中执行此操作的最佳方法是什么?我不能在这里只在 Starts_On 和 Ends_On 上使用两个范围,因为要向 Ends_On 添加范围查询,我必须对 Starts_On 进行相等性检查.

What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.

推荐答案

这是另一个想法(有点非常规).您可以创建一个用户定义的函数来实现第二个范围过滤器(在 Cassandra 2.2 和更新版本中).

Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).

假设你像这样定义你的表(用整数而不是时间戳显示以保持示例简单):

Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):

CREATE TABLE tasks (
    p int,
    task_id timeuuid,
    start int,
    end int,
    end_range int static,
    PRIMARY KEY(p, start));

现在我们创建一个用户定义的函数,根据结束时间检查返回的行,并返回匹配行的task_id,如下所示:

Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:

CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
    CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
    'if (end <= end_range) return task_id; else return null;';

现在我在第三个参数中使用了一个技巧.在明显的(主要?)疏忽中,您似乎无法将常量传递给用户定义的函数.因此,为了解决这个问题,我们传递了一个静态列 (end_range) 作为我们的常量.

Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.

所以首先我们必须设置我们想要的 end_range:

So first we have to set the end_range we want:

UPDATE tasks SET end_range=15 where p=1;

假设我们有这些数据:

SELECT * FROM tasks;

 p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
 1 |     1 |        15 |   5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
 1 |     2 |        15 |   7 | 3233a040-4a88-11e5-a180-433e07a8bafb
 1 |     4 |        15 |  22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
 1 |     8 |        15 |  15 | 37ec7840-4a88-11e5-a180-433e07a8bafb

现在让我们获取 start >= 2 和 end

Now let's get the task_id's that have start >= 2 and end <= 15:

SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
    WHERE p=1 AND start >= 2;

 start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
     2 |   7 |       3233a040-4a88-11e5-a180-433e07a8bafb
     4 |  22 |                                       null
     8 |  15 |       37ec7840-4a88-11e5-a180-433e07a8bafb

因此,这为您提供了匹配的 task_id,而您必须忽略空行(我还没有想出使用 UDF 删除行的方法).您会注意到 start >= 2 的过滤器在将其传递给 UDF 之前删除了一行.

So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.

无论如何显然不是一个完美的方法,但它可能是您可以使用的方法.:)

Anyway not a perfect method obviously, but it might be something you can work with. :)

这篇关于Cassandra - 重叠数据范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 15:45