问题描述
我的卡珊德拉架构包含这是一个时间戳的分区键的表,和参数
列这是一个聚集键。
My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter
column which is a clustering key.
每个分区包含10K +行。此以每秒1分区的速率记录数据。
Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.
在另一方面,用户可以定义数据集和我有包含另一个表中,作为分区键的数据集名称和一个聚类柱是时间戳参照其他表(这样一数据集是分区键的列表)。
On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).
当然,我想这样做看起来像Cassandra的一个反模式,因为我想加入两个表。
Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.
不过使用SQL星火我可以运行这样一个查询,并执行加入
。
However using Spark SQL I can run such a query and perform the JOIN
.
SELECT * from datasets JOIN data
WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'
现在的问题是:星火SQL足够聪明,只读数据
对应于时间戳
S IN 数据集定义
?
Now the question is: is Spark SQL smart enough to read only the partitions of data
which correspond to the timestamp
s defined in datasets
?
推荐答案
编辑:关于连接优化解决答案
fix the answer with regard to join optimization
是星火SQL足够聪明的只读对应于数据集定义的时间戳数据的分区?
没有。实际上,由于您提供的数据集表分区键,星火/ Cassandra的连接器将执行 predicate按下并直接卡桑德拉是 CQL 。但不会有predicate倒推的连接操作本身,除非您使用 joinWithCassandraTable()
No. In fact, since you provide the partition key for the datasets table, the Spark/Cassandra connector will perform predicate push down and execute the partition restriction directly in Cassandra with CQL. But there will be no predicate push down for the join operation itself unless you use the RDD API with joinWithCassandraTable()
在这里看到所有可能的predicate倒推的情况:<一href=\"https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandra$p$pdicatePushDown.scala\" rel=\"nofollow\">https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandra$p$pdicatePushDown.scala
See here for all possible predicate push down situations: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandraPredicatePushDown.scala
这篇关于星火SQL和卡桑德拉JOIN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!