问题描述
我正在尝试使用以下方法过滤巨大的 C* 表的一小部分:
I'm trying to filter on a small part of a huge C* table by using:
val snapshotsFiltered = sc.parallelize(startDate to endDate).map(TableKey(_)).joinWithCassandraTable("listener","snapshots_tspark")
println("Done Join")
//*******
//get only the snapshots and create rdd temp table
val jsons = snapshotsFiltered.map(_._2.getString("snapshot"))
val jsonSchemaRDD = sqlContext.jsonRDD(jsons)
jsonSchemaRDD.registerTempTable("snapshots_json")
与:
case class TableKey(created: Long) //(created, imei, when)--> created = partititon key | imei, when = clustering key
而 cassandra 表架构是:
And the cassandra table schema is:
CREATE TABLE listener.snapshots_tspark (
created timestamp,
imei text,
when timestamp,
snapshot text,
PRIMARY KEY (created, imei, when) ) WITH CLUSTERING ORDER BY (imei ASC, when ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
问题是在 println 完成后进程冻结,在 spark master ui 上没有错误.
The problem is that the process freezes after the println done with no errors on spark master ui.
[Stage 0:> (0 + 2) / 2]
Join 不能使用时间戳作为分区键吗?为什么会冻结?
Won`t the Join work with timestamp as the partition key? Why it freezes?
推荐答案
通过使用:
sc.parallelize(startDate to endDate)
将 startData 和 endDate 作为 Longs 从日期生成的格式:
With the startData and endDate as Longs generated from Dates by the format:
("yyyy-MM-dd HH:mm:ss")
我使用 spark 来构建一个巨大的数组(100,000 多个对象)以连接 C* 表,它根本没有卡住 - C* 努力使连接发生并返回数据.
I made spark to build a huge array (100,000+ objects) to join with C* table and it didn't stuck at all- C* worked hard to make the join happen and return the data.
最后,我将范围更改为:
Finally, I changed my range to:
case class TableKey(created_dh: String)
val data = Array("2015-10-29 12:00:00", "2015-10-29 13:00:00", "2015-10-29 14:00:00", "2015-10-29 15:00:00")
val snapshotsFiltered = sc.parallelize(data, 2).map(TableKey(_)).joinWithCassandraTable("listener","snapshots_tnew")
现在好了.
这篇关于Spark JoinWithCassandraTable on TimeStamp 分区键 STUCK的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!