问题描述
需要避免的陷阱是什么?你有交易中断吗?例如,我听说导出/导入Cassandra数据非常困难,让我想知道是否会阻碍将生产数据同步到开发环境。
And what are the pitfalls to avoid? Are there any deal breaks for you? E.g., I've heard that exporting/importing the Cassandra data is very difficult, making me wonder if that's going to hinder syncing production data to development environment.
BTW,很难找到好的教程Cassandra,我唯一的一个仍然很基本。
BTW, it's very hard to find good tutorials on Cassandra, the only one I have http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model is still pretty basic.
谢谢。 / p>
Thanks.
推荐答案
对我来说,最主要的是决定是使用OrderedPartitioner还是RandomPartitioner。
For me, the main thing is a decision whether to use the OrderedPartitioner or RandomPartitioner.
如果使用RandomPartitioner,则范围扫描是不可能的。这意味着你必须知道任何活动的确切键,包括清理旧数据。
If you use the RandomPartitioner, range scans are not possible. This means that you must know the exact key for any activity, INCLUDING CLEANING UP OLD DATA.
所以如果你有很多搅拌,除非你有一些魔法通过使用随机分区器可以轻松地丢失内容,这会导致磁盘空间泄漏,并最终消耗所有存储。
So if you've got a lot of churn, unless you have some magic way of knowing exactly which keys you've inserted stuff for, using the random partitioner you can easily "lose" stuff, which causes a disc space leak and will eventually consume all storage.
另一方面,你可以询问有序分割器在A和B之间的列族X中有什么键? - 它会告诉你。然后,您可以清理它们。
On the other hand, you can ask the ordered partitioner "what keys do I have in Column Family X between A and B" ? - and it'll tell you. You can then clean them up.
但是,还有一个缺点。由于Cassandra不执行自动负载平衡,如果使用有序分区器,很可能所有数据将只在一个或两个节点,而在其他节点,这意味着你会浪费资源。
However, there is a downside as well. As Cassandra doesn't do automatic load balancing, if you use the ordered partitioner, in all likelihood all your data will end up in just one or two nodes and none in the others, which means you'll waste resources.
我没有任何简单的答案,除非你可以得到最好的两个世界在某些情况下,通过放一个短的哈希值(你可以从其他数据源),例如用户ID的16位十六进制哈希值 - 这将给你4个十六进制数字,后面是你真正想要使用的任何密钥。
I don't have any easy answer for this, except you can get "best of both worlds" in some cases by putting a short hash value (of something you can enumerate easily from other data sources) on the beginning of your keys - for example a 16-bit hex hash of the user ID - which will give you 4 hex digits, followed by whatever the key is you really wanted to use.
然后,如果你有一个最近删除的用户列表,你可以只是哈希他们的ID和范围扫描,以清理与他们有关的任何东西。
Then if you had a list of recently-deleted users, you can just hash their IDs and range scan to clean up anything related to them.
应用程序错误可能会留下您忘记的孤立键,而且您将无法轻易检测到这些键,除非您编写一些垃圾收集器会定期扫描数据库中的每个键(这将需要一段时间 - 但您可以在块中进行)来检查不再需要的键。
And application bugs may leave orphaned keys that you've forgotten about, and you'll have no way of easily detecting them, unless you write some garbage collector which periodically scans every single key in the db (this is going to take a while - but you can do it in chunks) to check for ones which aren't needed any more.
这些都不是基于实际使用,而是我在研究期间发现的。我们在生产中不使用Cassandra。
None of this is based on real usage, just what I've figured out during research. We don't use Cassandra in production.
编辑:Cassandra现在在中继线中有二级索引。
Cassandra now does have secondary indexes in trunk.
这篇关于什么是设计Cassandra数据模型的最佳实践?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!