Cassandra查询次要索引非常慢

Cassandra查询次要索引非常慢

本文介绍了Cassandra查询次要索引非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个大约40k行的表,查询次要索引缓慢(生产30秒)。我们的cassandra是1.2.8。表模式如下:

We have a table with about 40k rows, querying on secondary index is slow(30 seconds on production). Our cassandra is 1.2.8. The table schema is as following:

CREATE TABLE usertask (
  tid uuid PRIMARY KEY,
  content text,
  ts int
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

CREATE INDEX usertask_ts_idx ON usertask (ts);

当我开启跟踪时,我注意到有很多行如下:

When I turn on tracing, I notice there is a lot of lines like the following:

Executing single-partition query on usertask.usertask_ts_idx

只有40k行,看起来像对usertask_ts_idx有几千个查询。可能是什么问题?感谢

With only 40k rows, it looks like there are some thousands of query on usertask_ts_idx. What could be the problem? Thanks

我在我们的测试服务器上尝试同样的查询,在测试服务器上1-2秒)。在比较跟踪日志之后,差别是在寻求分割数据文件中的索引部分所花费的时间。在我们的生产中,每个寻找需要1000-3000微秒,在开发服务器上需要100微秒。我想我们的生产服务器没有足够的内存来缓存数据文件,所以在搜索数据文件时速度很慢。

I try the same query on our test server, it is much faster(30 seconds on prod, 1-2 seconds on test server). After comparing the tracing log, the difference is the time spending at seeking to partition indexed section in data file. On our production it takes 1000-3000 micro sec for each seek, on dev server it takes 100 micro seconds. I guess our production server has not enough memory to cache the data file so it is slow at seeking in data file.

推荐答案

假设 ts 是一个时间戳,在这种情况下,这不是一个很好的候选二级索引。原因是它是一个高基数值(即所有值基本上是唯一的)。这意味着,对于 usertask 中的每一行,索引中几乎只有一行 - 有效地产生了一个联接操作。连接在分布式数据库上非常慢。因为你没有显示你的查询,我不知道你在做什么,但如果你想基于时间查询你需要重新思考你的模型。

I am presuming ts is a timestamp, in which case this is not a good candidate for a secondary index. The reason is that it's a high cardinality value (i.e. all values are essentially unique). This means you'll end up with almost one row in the index for each row in usertask--effectively resulting in a join operation. Joins are terribly slow on a distributed database. Since you haven't shown your query I'm not sure exactly what you're doing, but you'll need to rethink your model if you want to query based on time.

这篇关于Cassandra查询次要索引非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-18 18:39