问题描述
我正在尝试在Cassandra上发出cqlsh请求。这个想法是分析来自基因组中注释的不同基因寄存器的平均长度。输入数据后,我有以下athaliana.tab表:
I am trying to make a cqlsh request on Cassandra. The idea is analyse the average length from different 'gene' registers that had been annotated in a genome. After enter the data, I have this athaliana.tab table:
id | chr | comments | end | orf | sense | start | type
--------------------------------------+-----+-------------------+----------+-----+-------+----------+------
d2ab2520-6734-11e5-955c-234085c1edec | 1 | gene_id AT1G16340 | 5590338 | 0 | - | 5590241 | CDS
d4169c00-6734-11e5-955c-234085c1edec | 1 | gene_id AT1G16610 | 5676495 | . | - | 5676429 | exon
a8c792c0-6734-11e5-955c-234085c1edec | 1 | gene_id AT1G07485 | 2301889 | 0 | + | 2301665 | CDS
3bd5c0a0-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G51980 | 19326916 | . | - | 19326733 | exon
263b5b60-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G44990 | 17007808 | . | - | 17007542 | gene
67989a50-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G63110 | 23405144 | . | + | 23404821 | UTR
26f7f4a0-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G45180 | 17101207 | 0 | + | 17101109 | CDS
3743dc70-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G50840 | 18841644 | 0 | + | 18840965 | CDS
e5099940-6734-11e5-955c-234085c1edec | 1 | gene_id AT1G20620 | 7145780 | . | + | 7145691 | exon
2ba30620-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G48180 | 17793717 | . | - | 17792449 | gene
这个想法是获得最终起始列数据之间的差值并计算平均值。我已经尝试过这种方式:
The idea is obtain the subtraction between end-start columns data and calculate the average. I have tried by this way:
SELECT avg(end-start) FROM athaliana.tab WHERE chr = '1' AND type = 'gene';
,但结果为:
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="line 1:14 no viable alternative at input '-' (SELECT avg([end]-...)">
有些想法?任何帮助将不胜感激。
Some ideas? Any help will be really appreciated.
推荐答案
Cassandra旨在实现最大的读写效率,并且不内置支持应用于多个列的函数,例如 plus
或减
也不用于聚集,即适用于多行的函数,例如 avg
或 count
。
Cassandra is designed for maximum read and write efficiency and has no built-in support for functions applying to multiple columns such as plus
or minus
nor for aggregates, i.e., functions applying to multiple rows, such as avg
or count
.
直到最近,解决方法是将函数的结果存储在
Until recently, the workaround was to store the results of your functions in a separate column and perform aggregations in your own application code.
但是,Cassandra 2.2和3.0现在提供了用户定义的函数(UDF)和用户定义的聚合(UDA)。您现在可以定义自己的f功能,然后让Cassandra为您调用它们。请参见和或。
However, Cassandra 2.2 and 3.0 now offer User-Defined Functions (UDFs) and User-Defined Aggregates (UDAs). You can now define your own functions, then have Cassandra invoke them for you. See the CQL spec chapters on UDFs and UDAs or this blog post.
首先,创建以下函数和聚合:
First, create the following functions and aggregates:
USE athaliana;
CREATE FUNCTION minus (x int, y int) RETURNS NULL ON NULL INPUT RETURNS int LANGUAGE java AS 'return x-y;';
CREATE FUNCTION avgState (state tuple<int,bigint>, val int) CALLED ON NULL INPUT RETURNS tuple<int,bigint> LANGUAGE java AS '
if (val != null) {
state.setInt(0, state.getInt(0)+1);
state.setLong(1, state.getLong(1)+val.intValue());
}
return state;
';
CREATE FUNCTION avgFinal (state tuple<int,bigint>) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS '
double r = 0;
if (state.getInt(0) == 0) return null;
r = state.getLong(1);
r /= state.getInt(0);
return Double.valueOf(r);
';
CREATE AGGREGATE avg(int) SFUNC avgState STYPE tuple<int,bigint> FINALFUNC avgFinal INITCOND (0, 0);
然后可以用以下方式重写查询:
Then your query could be rewritten this way:
SELECT avg(minus(end,start)) FROM athaliana.tab WHERE chr = '1' AND type = 'gene';
请注意,只应在单个分区键上使用聚合;如果不是这种情况,则会发出警告。
Note that you should only use aggregates on a single partition key; if this is not the case, a warning will be raised.
这篇关于通过cqlsh请求对Cassandra执行减法的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!