问题描述
有几种类型的数据库用于不同的目的,但通常 MySQL 用于所有用途,因为它是最知名的数据库.举个例子,我公司一个大数据的应用,初期有一个MySQL数据库,难以置信,会给公司带来严重的后果.为什么选择 MySQL?只是因为没有人知道应该如何(以及何时)使用另一个 DBMS.
所以,我的问题不是关于供应商,而是关于数据库的类型.你能给我一个具体情况(或应用程序)的实际例子吗?强烈建议在哪些类型的数据库中使用它?
示例:
• 由于 Y,社交网络应该使用类型 X.
• MongoDB 或沙发数据库不支持交易,因此 Document DB 不适合银行或拍卖网站的应用程序.
等等……
关系: MySQL、PostgreSQL、SQLite、火鸟、MariaDB、Oracle 数据库、SQL 服务器、IBM DB2, IBMInformix, Teradata
对象: ZODB, DB4O、Eloquera、Versant、Objectivity DB, VelocityDB
图形数据库: AllegroGraph、Neo4j, OrientDB, 无限图,graphbase,sparkledb、flockdb、BrightstarDB
键值存储: Amazon DynamoDB, Redis,Riak,伏地魔、FoundationDB、leveldb、BangDB、KAI、hamsterdb、Tarantool、Maxtable、HyperDex、Genomu, Memcachedb
列族: 大表,Hbase,超级表,Cassandra, Apache Accumulo
RDF 存储: Apache Jena、芝麻
多模型数据库: arangodb, Datomic, 东方数据库, FatDB、AlchemyDB
文档: Mongo DB, 沙发数据库、重新思考数据库、Raven DB、terrastore、Jas DB, Raptor DB, djon DB, EJDB, denso DB, Couchbase
分层: InterSystems Caché,GT.M 感谢@Laurent Parenteau
我发现了两篇关于这个主题的令人印象深刻的文章.所有功劳归于 highscalability.com.此答案中的信息转录自这些文章:
如果您的应用程序需要...
• 复杂事务,因为您不能承受丢失数据的损失,或者如果您想要一个简单的事务编程模型,那么请查看关系或网格数据库.
• 示例: 一个可能需要完整 ACID.当我买一个产品时我很不高兴,他们后来说他们缺货了.我不想要有偿交易.我想要我的物品!
• 规模化 然后 NoSQL 或 SQL 可以工作.寻找支持横向扩展、分区、实时添加和删除机器、负载平衡、自动分片和重新平衡以及容错的系统.
• 始终能够写入到数据库,因为您需要高可用性,然后查看 Bigtable 具有最终一致性的克隆.
• 处理大量小型连续读取和写入,这可能是易失性的,然后查看提供快速内存访问的文档或键值或数据库.另外,请考虑 SSD.
• 要实现社交网络操作,那么您首先可能需要一个图形数据库,或者第二个数据库,如 Riak 支持关系.具有简单 SQL 连接的内存中关系数据库可能足以处理小型数据集.Redis' set 和 list 操作也可以工作.
• 要对各种访问模式和数据类型进行操作,然后查看文档数据库,它们通常很灵活且性能良好.
• 强大的大型数据集离线报告然后看看 Hadoop 第一和第二, 支持 MapReduce 的产品.支持 MapReduce 不等于擅长它.
• 跨越多个数据中心,然后查看 Bigtable 克隆以及其他提供分布式选项的产品,可以处理长延迟并且分区容错.
• 构建 CRUD 应用程序然后查看文档数据库,它们可以轻松访问复杂数据而无需连接.
• 内置搜索然后查看 Riak.
• 操作数据结构,如列表、集合、队列、发布-订阅,然后查看Redis.对分布式锁定、封顶日志等很有用.
• 程序员友好性以对程序员友好的数据类型(如 JSON、HTTP、REST、Javascript)的形式出现,然后首先查看文档数据库,然后查看键值数据库.
• 交易结合物化视图用于实时数据馈送,然后查看VoltDB.非常适合数据汇总和 时间开窗.
• 企业级支持和 SLA 然后寻找能够满足该市场需求的产品.Membase 就是一个例子.
• 记录可能根本不需要一致性保证的连续流数据,然后查看 Bigtable 克隆,因为它们通常在可以处理大量写入的分布式文件系统上工作.
•尽可能简单操作然后寻找托管或PaaS 解决方案,因为他们会为您完成所有工作.
• 出售给企业客户,然后考虑使用关系数据库,因为他们习惯于关系技术.
• 在具有动态属性的对象之间动态建立关系,然后考虑使用图形数据库,因为它们通常不需要模式,并且可以通过编程逐步建立模型.
• 支持大型媒体,然后查看存储服务,例如 S3.NoSQL 系统往往无法处理大型 BLOBS,虽然 MongoDB 有文件服务.
• 快速有效地批量上传大量数据,然后寻找支持该方案的产品.大多数不会,因为它们不支持批量操作.
• 更简单的升级路径然后使用文档数据库或键值数据库等流动模式系统,因为它支持可选字段、添加字段和删除字段,而无需构建整个架构迁移框架.
• 实现完整性约束,然后选择支持 SQL DDL,在存储过程中实现它们,或者在应用程序代码中实现它们.
• 非常深的连接深度然后使用图形数据库,因为它们支持实体之间的极快导航.
• 移动行为接近数据,这样数据就不必通过网络移动,然后查看一种或另一种存储过程.这些可以在关系、网格、文档甚至键值数据库中找到.
• 缓存或存储 BLOB 数据,然后查看键值存储.缓存可以用于一些网页,或者保存复杂的对象,这些对象在加入关系数据库时成本很高,可以减少延迟等等.
• 经过验证的跟踪记录,例如不破坏数据并且正常工作,然后选择成熟的产品,当您遇到扩展(或其他问题)时,使用常见的解决方法之一(扩展、调整, memcached, 分片, 反规范化等).
• 流动数据类型,因为您的数据本质上不是表格,或者需要灵活数量的列,或者具有复杂的结构,或者因用户(或其他)而异,然后查看文档、键值和 Bigtable 克隆数据库.每个人的数据类型都有很大的灵活性.
• 其他业务部门运行快速关系查询,这样您就不必重新实现所有内容,然后使用支持 SQL 的数据库.
• 要在云中运行并自动充分利用云功能,那么我们可能还没有.
• 支持二级索引,因此您可以通过不同的键查找数据,然后查看关系数据库和Cassandra 的新 二级索引 支持.
• 创建不断增长的数据集(真的是BigData) 很少被访问,然后查看 Bigtable 克隆,它将通过分布式文件系统传播数据.
• 与其他服务集成,然后检查数据库是否提供某种后写同步功能,以便您可以捕获数据库更改并将其馈送到其他系统以确保一致性.
• 容错检查在电源故障、分区和其他故障情况下写入的持久性.
• 将技术封套推向一个似乎没人会走的方向,然后自己构建它,因为有时这就是伟大的需要.
• 要在 移动平台 上工作,然后查看 CouchDB/Mobile couchbase.
一般用例 (NoSQL)
• 巨大.NoSQL 被视为支持新数据堆栈的关键部分:大数据、大量用户、大量计算机、大供应链、大科学等等.当某些东西变得如此庞大以至于必须大规模分布时,NoSQL 就在那里,尽管并非所有 NoSQL 系统都以大为目标.Bigness 可以跨越许多不同的维度,而不仅仅是使用大量的磁盘空间.
• 大量写入性能.这可能是基于 Google 影响的规范用法.高音量.Facebook 需要存储 每月 1350 亿条消息 .例如,Twitter 存在存储 7 TB/数据/数据的问题一天 ,这一要求有望每年翻倍.这是数据太大而无法容纳一个节点的问题.以 80 MB/s 的速度存储 7TB 需要一天的时间,因此写入需要分布在集群上,这意味着键值访问、MapReduce、复制、容错、一致性问题等等.为了更快的写入,可以使用内存系统.
• 快速键值访问.这可能是 NoSQL 在一般思维模式中被引用次数第二多的优点.当延迟很重要时,很难在键上散列并直接从内存中读取值,或者只需一次磁盘查找.并非每个 NoSQL 产品都与快速访问有关,例如,有些产品更注重可靠性.但人们长期以来一直想要的是更好的 memcached,许多 NoSQL 系统都提供了.
• 灵活的模式和灵活的数据类型. NoSQL 产品支持一系列新的数据类型,这是 NoSQL 的一个主要创新领域.我们有:面向列、图形、高级数据结构、面向文档和键值.无需大量映射即可轻松存储复杂对象.开发人员喜欢避免使用复杂的架构和 ORM 框架.缺乏结构允许更大的灵活性.我们还有对程序和程序员友好的兼容数据类型,例如 JSON.
• 架构迁移. 无架构使处理架构迁移变得更容易,而无需过多担心.架构在某种意义上是动态的,因为它们是由应用程序在运行时强加的,因此应用程序的不同部分可以有不同的架构视图.
• 写作可用性.无论如何,您的写作都需要成功吗?然后我们可以进入分区,CAP, 最终一致性 和所有的爵士乐.
• 更易于维护、管理和操作.这是非常特定于产品的,但许多 NoSQL 供应商正试图通过让开发人员轻松采用它们来获得采用.他们在易用性、最少的管理和自动化操作上花费了大量精力.这可以降低运营成本,因为不必编写特殊代码来扩展从未打算以这种方式使用的系统.
• 没有单点故障. 并非每个产品都提供这一点,但我们看到了在相对容易配置和管理高可用性方面的明确融合,以及自动负载平衡和集群大小调整.完美的云合作伙伴.
• 普遍可用的并行计算.我们看到 MapReduce 已融入产品,这使得并行计算成为未来发展的常态.
• 程序员易于使用. 访问您的数据应该很容易.虽然关系模型对于最终用户(如会计师)来说是直观的,但对于开发人员来说却不是很直观.程序员了解键、值、JSON、Javascript 存储过程、HTTP 等等.NoSQL 适用于程序员.这是一场由开发商主导的政变.对数据库问题的响应不能总是聘请真正知识渊博的DBA,获取您的架构对,稍微去规范化等等,程序员更喜欢他们可以为自己工作的系统.让产品发挥作用应该不难.钱是问题的一部分.如果扩展一个产品的成本很高,那么你会不会选择更便宜、你可以控制、更容易使用、更容易扩展的产品?
• 为正确的问题使用正确的数据模型.不同的数据模型用于解决不同的问题.例如,已经付出了很多努力,将图操作嵌入到关系模型中,但它不起作用.在图数据库中解决图问题不是更好吗?我们现在看到了一种试图在问题和解决方案之间找到最佳匹配的一般策略.
• 避免碰壁. 许多项目在其项目中碰壁.他们已经用尽了所有选项来使他们的系统扩展或正常运行,并且想知道下一步是什么?选择一种产品和一种方法,可以通过使用增量添加的资源进行线性扩展来跳过墙壁,这是令人欣慰的.有一次这是不可能的.一切都需要定制,但这已经改变了.我们现在看到了项目可以轻松采用的可用的开箱即用产品.
• 分布式系统支持. 并不是每个人都担心非 NoSQL 系统所能达到的规模或性能.他们需要的是一个分布式系统,它可以跨越数据中心,同时处理故障场景而不会出现问题.NoSQL 系统,因为它们专注于规模,倾向于利用分区,倾向于不使用严格的一致性协议,因此非常适合在分布式场景中运行.
• 可调 CAP 权衡. NoSQL 系统通常是唯一带有滑块"的产品,用于选择它们想要在 CAP 范围内的位置.关系数据库选择强一致性,这意味着它们不能容忍分区故障.最后,这是一个商业决定,应该根据具体情况来决定.你的应用甚至关心一致性吗?几滴可以吗?您的应用需要强一致性还是弱一致性?可用性更重要还是一致性更重要?失败会比犯错更昂贵吗?很高兴拥有可以让您选择的产品.
• 更具体的用例
• 管理大量非事务性数据流:Apache 日志、应用程序日志、MySQL 日志、点击流、等等
• 同步在线和离线数据.这是 CouchDB 的目标.
• 在所有负载下的快速响应时间.
• 当复杂连接的查询负载对于 RDBMS 而言过大时,避免重连接.
• 低延迟至关重要的软实时系统.游戏就是一个例子.
• 需要支持各种不同的写入、读取、查询和一致性模式的应用程序.有些系统针对 50% 读取、50% 写入、95% 写入或 95% 读取进行了优化.只读应用程序需要极快的速度和弹性、简单的查询,并且可以容忍稍微陈旧的数据.需要中等性能、读/写访问、简单查询、完全权威数据的应用程序.具有复杂查询要求的只读应用程序.
• 负载平衡以适应数据和使用集中并帮助保持微处理器忙碌.
• 实时插入、更新和查询.
• 分层数据,如线程讨论和部件爆炸.
• 动态表创建.
• 两层应用程序,其中低延迟数据通过快速 NoSQL 接口提供,但数据本身可以由高延迟 Hadoop 应用程序或其他低优先级应用程序计算和更新.
• 顺序数据读取. 需要选择正确的底层数据存储模型.B 树可能不是顺序读取的最佳模型.
• 将可能需要更好性能/可扩展性的部分服务分割到自己的系统上.例如,用户登录可能需要高性能,并且此功能可以使用专用服务来实现这些目标.
• 缓存. 用于网站和其他应用程序的高性能缓存层.示例是大型强子对撞机使用的数据聚合系统的缓存.投票.
• 实时页面查看计数器.
• 用户注册、个人资料和会话数据.
• 文档、目录管理和内容管理系统. 将复杂文档存储为一个整体而不是组织为关系表的能力有助于实现这些.类似的逻辑适用于库存、购物车和其他结构化数据类型.
• 存档. 存储仍可在线访问的大量连续数据流.具有灵活架构的面向文档的数据库,可以处理架构随时间的变化.
• 分析.使用 MapReduce、Hive 或 Pig 执行支持高写入负载的分析查询和横向扩展系统.
• 使用异构类型的数据,例如,不同媒体类型的通用级别.
• 嵌入式系统.他们不想要 SQL 和服务器的开销,因此他们使用更简单的存储方式.
• 一个市场"游戏,您可以在其中拥有城镇中的建筑物.你想让某人的建筑列表快速弹出,所以你在建筑表的所有者列上进行分区,这样选择是单分区的.但是,当有人购买其他人的建筑物时,您会更新所有者列以及价格.
• JPL 正在使用 SimpleDB 存储 流动站 计划属性.S3 中的完整计划 blob 的引用.. Twitter, for example, has the problem of storing 7 TB/data per day with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.
• Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set. When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.
• Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program- and programmer-friendly compatible datatypes like JSON.
• Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.
• Write availability. Do your writes need to succeed no matter what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.
• Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.
• No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.
• Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.
• Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer-led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?
• Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying to find the best fit between a problem and solution.
• Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources. At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.
• Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.
• Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end, this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.
• More Specific Use Cases
• Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
• Syncing online and offline data. This is a niche CouchDB has targeted.
• Fast response times under all loads.
• Avoiding heavy joins for when the query load for complex joins become too large for an RDBMS.
• Soft real-time systems where low latency is critical. Games are one example.
• Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads. Read-only applications needing extreme speed and resiliency, simple queries, and can tolerate slightly stale data. Applications requiring moderate performance, read/write access, simple queries, completely authoritative data. A read-only application which complex query requirements.
• Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
• Real-time inserts, updates, and queries.
• Hierarchical data like threaded discussions and parts explosion.
• Dynamic table creation.
• Two-tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
• Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
• Slicing off part of service that may need better performance/scalability onto its own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
• Caching. A high performance caching tier for websites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.Voting.
• Real-time page view counters.
• User registration, profile, and session data.
• Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
• Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.
• Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.
• Working with heterogeneous types of data, for example, different media types at a generic level.
• Embedded systems. They don’t want the overhead of SQL and servers, so they use something simpler for storage.
• A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.
• JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3.
• Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations.
• Fraud detection by comparing transactions to known patterns in real-time.
• Helping diagnose the typology of tumors by integrating the history of every patient.
• In-memory database for high update situations, like a website that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.
• Handling lower-frequency multi-partition queries using materialized views while continuing to process high-frequency streaming data.
• Priority queues.
• Running calculations on cached data, using a program friendly interface, without having to go through an ORM.
• Uniq a large dataset using simple key-value columns.
• To keep querying fast, values can be rolled-up into different time slices.
• Computing the intersection of two massive sets, where a join would be too slow.
• A timeline ala Twitter.
Redis use cases, VoltDB use cases and more find here.
这篇关于各类数据库的实例(真实案例)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!