问题描述
像 Couchbase 这样的 NoSQL 数据库确实在内存中保存了大量文档,因此它们的速度非常快,但它也对运行它的服务器的内存大小提出了更高的要求.
我正在寻找在 NoSQL 数据库中存储文档的几种相反策略之间的最佳策略.它们是:
- 优化速度
将整个信息放入一个(大)文档的好处是,使用单个 GET 可以从内存或磁盘中检索信息(如果之前已从内存中清除).对于无模式 NoSQL 数据库,这几乎是希望的.但最终文档会变得太大,占用大量内存,总共能保存在内存中的文档会越来越少
- 内存优化
将所有文档拆分为多个文档(例如,使用复合键,如本问题所述:为面向文档的数据库设计记录键 - 最佳实践 特别是当这些文档仅包含特定读取/更新操作中所需的信息时,将允许更多(临时)文档留在记忆中.
我正在查看的用例是来自电信提供商的呼叫详细记录 (CDR).这些 CDR 通常每天都会达到数亿.然而,这些客户中的许多人在每一天都没有提供单一的记录(我正在关注东南亚市场,它的预付费用占主导地位,而且数据饱和度仍然较低).这意味着通常有大量文档可能每隔一天进行一次读取/更新,只有一小部分文档每天会有几个读取/更新周期.
向我建议的一个解决方案是构建 2 个存储桶,将更多的 RAM 分配给更临时的存储桶,而将更少的 RAM 分配给保存较大文档的第二个存储桶.这将允许更快地访问更多的临时数据,而更慢地访问更大的文档,例如保存完全不改变的配置文件/用户信息.不过,我确实看到了这个提议的两个缺点,一个是您不能跨两个存储桶构建视图(Map/Reduce)(这是专门针对 Couchbase,其他 NoSQL 解决方案可能允许这样做),第二个是更多开销随着用户群的增长,密切管理两个存储桶的内存分配之间的平衡.
是否有其他人受到此挑战,您对此问题的解决方案是什么?从你的 POV 来看,最好的策略是什么?为什么?显然,这两种策略都处于中间状态,只有一个文档或将一个大文档拆分成数百个文档都不是 IMO 的理想解决方案.
编辑 2014-9-14好的,虽然这接近于回答我自己的问题,但到目前为止还没有任何提供的解决方案,在这里发表评论是我现在计划如何组织数据的更多背景知识,试图在速度和内存消耗之间达到最佳平衡点:
Mobile_No:个人资料
- 它保存来自表的配置文件信息,而不是直接来自 CDR.较少的瞬态数据在这里输入,例如年龄、性别和姓名.密钥是由手机号码 (MSISDN) 和单词 profile 组成的复合密钥,以:"分隔
Mobile_No:收入
- 其中包含临时信息,例如使用计数器和累积客户花费的总收入的变量.密钥还是一个复合密钥,由手机号码 (MSISDN) 和收入一词组成,以:"分隔
Mobile_No:Optin
- 其中包含有关客户何时选择加入该计划以及他/她何时再次选择退出该计划的半暂时信息.这可能发生多次,并通过数组处理.密钥同样是由手机号码 (MSISDN) 和 optin 一词组成的复合密钥,以:"分隔
Connection_Id
- 其中包含有关通过语音或视频通话或 SMS/MMS 完成的特定 A/B 连接(发送方/接收方)的信息.密钥由两个串联的 mobile_no 组成.
在对文档结构进行这些更改之前,我将所有配置文件、收入和选择信息放在一个大文档中,始终将 connection_id 作为单独的文档.这种新的文档存储策略有望让我在速度和内存消耗之间取得更好的折衷,因为我将主文档拆分为多个文档,以便每个文档仅包含在应用程序的一个步骤中读取/更新的重要信息.
这还考虑了随时间变化的不同速率,其中一些数据非常短暂(例如计数器和随着每次 CDR 进入而更新的累积收入字段)并且配置文件信息大部分未更改.我确实希望这能让我更好地理解我想要实现的目标,非常欢迎评论和反馈.
感谢您更新您的原始问题.当您谈论在粗粒度文档与细粒度文档之间找到适当的平衡时,您是正确的.
文档的最终架构实际上属于您的特定业务领域需求.您必须在用例中识别整体所需的数据块",然后以此为基础存储文档形状.以下是您在设计文档结构时需要执行的一些高级步骤:
- 确定您的应用/服务的所有文档消费用例.(读取、读写、可搜索的项目)
- 设计您的文档(很可能您最终会得到几个较小的文档,而不是一个包含所有内容的大文档)
- 为不同的文档类型设计可以在一个存储桶中共存的文档键(例如,在键值中使用命名空间)
- 针对您的用例对生成的模型进行试运行",以查看您对 noSQL 和所有事务的优化(读/写)事务交易中所需的文件数据.
- 为您的用例运行性能测试(尝试模拟预期负载至少高出 2 倍)
注意:当您设计不同的文档时,可以有某种冗余(记住它不是具有规范化形式的 RDBMS),将其更多地视为面向对象的设计.
注意 2:如果您的键之外有可搜索的项目(例如,按姓氏开头"和其他一些动态搜索条件搜索客户)考虑使用 ElasticSearch 与 CB 的集成,或者您也可以试试 CB3.0 附带的 N1QL 查询语言.
您似乎朝着正确的方向前进,将其拆分为多个由 MSISDN 链接的较小文档,例如:MSISDN:profile、MSISDN:revenue、MSISDN:optin.我会特别注意您的最后一个文档类型A/B"连接.听起来它可能会产生大量且本质上是瞬态的......所以你必须找出这些文件必须在 Couchbase 存储桶中存在多长时间.您可以指定 TTL(生存时间),以便自动清除旧文档.
NoSQL databases like Couchbase do hold a lot of documents in memory, hence their enormous speed but it's also putting a greater demand on the memory size of the server(s) it's running on.
I'm looking for the best strategy between several contrary strategies of storing documents in a NoSQL database. These are:
- Optimise for speed
Putting the whole information into one (big) document has the advantage that with a single GET the information can be retrieved from memory or from disk (if it was purged from memory before). With the schema-less NoSQL databases this almost wished. But eventually the document will become too big and eat up a lot of memory, less documents will be able to be kept in memory in total
- Optimise for memory
Splitting up all documents into several documents (eg using compound keys as what is described in this question: Designing record keys for document-oriented database - best practice especially when those documents would only hold information that is necessary in a specific Read/Update operation would allow more (transient) documents to be held in memory.
The use case I'm looking at is Call Detail Records (CDR's) from Telecommunication Providers. These CDR's all go into hundreds of millions typically per day. Yet, many of these customer don't provide a single record on each given day (I'm looking at the South-East Asian market with it's Prepaid dominance and still less data saturation). That would mean that typically a large number of documents are having a Read/Update maybe every other day, only a small percentage will have several Read/Update cycles per day.
One solution that was suggested to me is to build 2 buckets, with more RAM being allocated to the more transient ones and less RAM being allocated to the second bucket holding the bigger documents. That would allow a faster access to the more transient data and more slower one to the bigger document which eg holds profile/user information that isn't changing at all. I do see two downsides to this proposal though, one is that you can't build a view (Map/Reduce) across two buckets (this is specifically for Couchbase, other NoSQL solution might allow this) and the second one would be more overhead in managing closely the balance between the memory allocation for both buckets as the user base growths.
Has anyone else being challenged by this and what was your solution to that problem? What would be the best strategy from your POV and why? Clearly it most be something in the middle of both strategies, having only one document or having one big document split up into hundreds of documents can't be the ideal solution IMO.
EDIT 2014-9-14Ok, though that comes close to answering my own question but in absence of any offered solution so far and following a comment here is a bit more background how I now plan to organise my data, trying to achieve a sweet spot between speed and memory consumption:
Mobile_No:Profile
- this holds profile information from a table, not directly from a CDR. Less transient data goes in here like age, gender and name. The key is a compound key consisting of the mobile number (MSISDN) and the word profile, separated by a ":"
Mobile_No:Revenue
- this holds transient information like usage counters and variables accumulating the total revenue the customer spent. The key is again a compound key consisting of the mobile number (MSISDN) and the word revenue, separated by a ":"
Mobile_No:Optin
- this holds semi transient information about when a customer opted into the program and when he/she opted out of the program again. This can happen several times and is handled via an array. The key is again a compound key consisting of the mobile number (MSISDN) and the word optin, separated by a ":"
Connection_Id
- this holds information about a specific A/B connection (sender/receiver) which was done via voice or video call or SMS/MMS. The key is consisting of both mobile_no's which are concatenated.
Before these changes in the document structure I was putting all the profile, revenue and optin information in one big document, always keeping the connection_id as a separate document. This new document storing strategy gives me hopefully a better compromise between speed and memory consumption as I split the main document into several documents so that each of them has only the important information that is read/updated in a single step of the app.
This also takes care of the different rate of changes over time with some data being very transient (like the counters and the accumulative revenue field that gets updated with every CDR coming in) and the profile information being mostly unchanged. I do hope this gives a better understanding of what I'm trying to achieve, comments and feedback is more than welcome.
Thank you for updating your original question. You are correct when you talking about finding a right balance between coarse grained documents vs. fine grained.
The final architecture of the documents actually falls under your particular business domain needs. You have to identify in your use cases "chunks" of data that are needed as a whole and then base your stored documents shape on this.Here are some high level steps you need to perform when you design your documents structure:
Note: When you design different docs its OK to have some sort of redundancy (remember its not RDBMS with normalized form) think of it more as Object Oriented Design.
Note2: If you have searchable items that outside of your keys (e.g. search customers by last name "starts with" and some other dynamic search criteria) consider using ElasticSearch integration with CB or you can also try N1QL query language that is coming with CB3.0.
It seems that you going in a right direction by splitting into several smaller documents all linked by a MSISDN e.g.: MSISDN:profile, MSISDN:revenue, MSISDN:optin. I would pay special attention to your last document type "A/B" connection. That sounds like it might generate large volume and in nature transient...so you have to find out how long these documents have to live in Couchbase bucket. You can specify TTL (time to live) so that old docs will be auto-cleared up.
这篇关于NoSQL 数据库中最好的文档存储策略是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!