本文介绍了数据库连接何时以及为何成本高昂?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对数据库进行一些研究,并且正在研究关系数据库的一些局限性.

我发现大表的连接非常昂贵,但我不完全确定为什么.DBMS执行join操作需要做什么,瓶颈在哪里?
非规范化如何帮助克服这种开销?其他优化技术(例如索引)有何帮助?

欢迎个人经验!如果您要发布资源链接,请避免使用维基百科.我已经知道在哪里可以找到.

关于此,我想知道 BigTable 和 SimpleDB 等云服务数据库使用的非规范化方法.请参阅这个问题.

解决方案

非规范化以提高性能?听起来很有说服力,但站不住脚.

Chris Date 与 Ted Codd 博士一起是关系数据模型的最初支持者,他对反对规范化的错误论据失去耐心,并使用科学方法系统地拆除它们:他获得了大型数据库并测试 这些断言.

我认为他是在 Relational Database Writings 1988-1991 中写的,但这本书后来被编入了数据库系统简介的第六版,也就是 关于数据库理论和设计的权威文本,在我撰写的第八版中,可能会在未来几十年继续印刷.当我们大多数人还赤脚跑来跑去时,Chris Date 是该领域的专家.

他发现:

  • 其中一些适用于特殊情况
  • 所有这些都无法用于一般用途
  • 对于其他特殊情况,所有这些都明显更糟

这一切都归结为减轻工作集的大小.包含正确选择的键和正确设置的索引的连接是廉价的,而不是昂贵的,因为它们允许在行被实体化之前对结果进行显着修剪.

物化结果涉及批量磁盘读取,这是该练习中成本最高的一个数量级.相比之下,执行连接在逻辑上只需要检索.在实践中,甚至不获取键值:键哈希值用于连接比较,减轻多列连接的成本并从根本上降低涉及字符串比较的连接成本.不仅更适合缓存,而且磁盘读取工作也少得多.

此外,一个好的优化器会选择最严格的条件并在执行连接之前应用它,非常有效地利用高基数索引上连接的高选择性.

诚然,这种类型的优化也可以应用于非规范化数据库,但是那些想要对架构进行非规范化的人通常在(如果)设置索引时不会考虑基数.

了解表扫描(在生成连接的过程中检查表中的每一行)在实践中很少见,了解这一点很重要.仅当以下一项或多项成立时,查询优化器才会选择表扫描.

  • 关系中的行少于 200 行(在这种情况下,扫描会更便宜)
  • 连接列上没有合适的索引(如果连接这些列有意义,那么为什么不将它们编入索引?修复它)
  • 在比较列之前需要进行类型强制转换(WTF?!修复它或回家)请参阅 ADO.NET 问题的结尾说明
  • 比较的参数之一是表达式(无索引)

执行操作比不执行更昂贵.然而,执行错误的操作,被迫进入无意义的磁盘I/O,然后在执行你真正需要的连接之前丢弃渣滓,这要贵得多.即使预先计算了错误"的操作并且合理地应用了索引,仍然存在显着的惩罚.预先计算连接的非规范化 - 尽管需要更新异常 - 是对特定连接的承诺.如果您需要不同的加入,那么该承诺将使您付出代价.

如果有人想提醒我这是一个不断变化的世界,我想你会发现在更粗糙的硬件上更大的数据集只会夸大 Date 发现的传播范围.

对于在计费系统或垃圾邮件生成器上工作的所有人(真为你感到羞耻)并且愤怒地把手放在键盘上告诉我你知道非规范化更快的事实,对不起,但你生活在一个特殊情况 - 特别是您按顺序处理所有数据的情况.这不是一般情况,您在您的策略中是合理的.

没有错误地概括它是合理的.有关在数据仓库场景中正确使用非规范化的更多信息,请参阅注释部分的末尾.

我也想回复

联接只是带有一些唇彩的笛卡尔积

真是一堆废话.尽早应用限制,最先应用限制.你已经阅读了理论,但你还没有理解它.查询优化器将联接视为适用谓词的笛卡尔积".这是一种符号表示(实际上是一种规范化)以促进符号分解,因此优化器可以生成所有等效转换并按成本和选择性对它们进行排名,以便它可以选择最佳查询计划.

让优化器生成笛卡尔积的唯一方法是不提供谓词:SELECT * FROM A,B

注意事项

David Aldridge 提供了一些重要的附加信息.

除了索引和表扫描之外,确实还有多种其他策略,现代优化器会在生成执行计划之前将它们全部消耗掉.

一条实用的建议:如果它可以用作外键,则将其编入索引,以便优化器可用索引策略.

我曾经比 MSSQL 优化器更聪明.这在两个版本前发生了变化.现在它通常教.在非常真实的意义上,它是一个专家系统,在一个足够封闭的领域内汇集了许多非常聪明的人的所有智慧,因此基于规则的系统是有效的.

Bollocks"可能不够圆滑.我被要求不要那么傲慢,并提醒我数学不会说谎.这是真的,但并非数学模型的所有含义都必须从字面上理解.如果你小心地避免检查它们的荒谬性(双关语)并且在尝试解释你的方程之前确保将它们全部取消,那么负数的平方根非常方便.

我如此野蛮回应的原因是声明的措辞是这样的

加入笛卡尔积...

这可能不是什么意思,但它就是所写的,而且绝对是不真实的.笛卡尔积是一种关系.连接是一个函数.更具体地说,连接是一个关系值函数.如果谓词为空,它将生成笛卡尔积,并检查它是否这样做是对数据库查询引擎的正确性检查,但没有人在实践中编写无约束连接,因为它们在课堂之外没有实际价值.

我之所以这么说是因为我不希望读者陷入将模型与建模的事物混淆的古老陷阱.模型是一种近似值,为方便操作而特意简化.

选择表扫描连接策略的截止点可能因数据库引擎而异.它受许多实现决策的影响,例如树节点填充因子、键值大小和算法的微妙之处,但广义上讲,高性能索引的执行时间为 k log n + c.C 项是主要由设置时间构成的固定开销,曲线的形状意味着在 n 达到数百之前,您不会获得回报(与线性搜索相比).>

有时非规范化是个好主意

非规范化是对特定连接策略的承诺.如前所述,这会干扰其他加入策略.但是,如果您有大量磁盘空间、可预测的访问模式以及处理大部分或全部磁盘空间的趋势,那么预先计算连接可能非常值得.

您还可以找出您的操作通常使用的访问路径,并预先计算这些访问路径的所有连接.这是数据仓库背后的前提,或者至少是当它们由知道为什么要做他们正在做的事情的人构建时,而不仅仅是为了遵守流行语.

设计合理的数据仓库是通过标准化事务处理系统的批量转换定期生成的.这种操作和报告数据库的分离对于消除 OLTP 和 OLAP(在线事务处理,即数据输入,以及在线分析处理,即报告)之间的冲突具有非常理想的效果.

这里的一个重点是,除了定期更新之外,数据仓库是只读.这使得更新异常的问题没有实际意义.

不要犯对 OLTP 数据库(发生数据输入的数据库)进行非规范化的错误.计费运行可能会更快,但如果你这样做,你会得到更新异常.曾经试图让读者文摘停止向您发送内容吗?

现在磁盘空间很便宜,所以请自己出击.但非规范化只是数据仓库故事的一部分.更大的性能收益来自预先计算的汇总值:每月总计,诸如此类.总是关于减少工作集.

类型不匹配的 ADO.NET 问题

假设您有一个 SQL Server 表,其中包含一个 varchar 类型的索引列,并且您使用 AddWithValue 传递一个参数来限制对该列的查询.C# 字符串是 Unicode,因此推断的参数类型将为 NVARCHAR,它与 VARCHAR 不匹配.

VARCHAR 到 NVARCHAR 是一种扩大转换,因此它是隐式发生的 - 但告别索引,祝你好运找出原因.

计算磁盘命中数"(Rick James)

如果一切都缓存在 RAM 中,JOINs 相当便宜.也就是说,规范化没有太多性能损失.

如果规范化"模式导致 JOINs 大量访问磁盘,但等效的非规范化"模式不必访问磁盘,那么非规范化将赢得性能竞争.

原作者评论:现代数据库引擎非常擅长组织访问顺序,以最大限度地减少连接操作期间的缓存未命中.以上虽然是正确的,但可能会被误解为暗示连接在大数据上必然会带来问题.这会导致缺乏经验的开发人员做出错误的决策.

I'm doing some research into databases and I'm looking at some limitations of relational DBs.

I'm getting that joins of large tables is very expensive, but I'm not completely sure why. What does the DBMS need to do to execute a join operation, where is the bottleneck?
How can denormalization help to overcome this expense? How do other optimization techniques (indexing, for example) help?

Personal experiences are welcome! If you're going to post links to resources, please avoid Wikipedia. I know where to find that already.

In relation to this, I'm wondering about the denormalized approaches used by cloud service databases like BigTable and SimpleDB. See this question.

解决方案

Denormalising to improve performance? It sounds convincing, but it doesn't hold water.

Chris Date, who in company with Dr Ted Codd was the original proponent of the relational data model, ran out of patience with misinformed arguments against normalisation and systematically demolished them using scientific method: he got large databases and tested these assertions.

I think he wrote it up in Relational Database Writings 1988-1991 but this book was later rolled into edition six of Introduction to Database Systems, which is the definitive text on database theory and design, in its eighth edition as I write and likely to remain in print for decades to come. Chris Date was an expert in this field when most of us were still running around barefoot.

He found that:

  • Some of them hold for special cases
  • All of them fail to pay off for general use
  • All of them are significantly worse for other special cases

It all comes back to mitigating the size of the working set. Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result before the rows are materialised.

Materialising the result involves bulk disk reads which are the most expensive aspect of the exercise by an order of magnitude. Performing a join, by contrast, logically requires retrieval of only the keys. In practice, not even the key values are fetched: the key hash values are used for join comparisons, mitigating the cost of multi-column joins and radically reducing the cost of joins involving string comparisons. Not only will vastly more fit in cache, there's a lot less disk reading to do.

Moreover, a good optimiser will choose the most restrictive condition and apply it before it performs a join, very effectively leveraging the high selectivity of joins on indexes with high cardinality.

Admittedly this type of optimisation can also be applied to denormalised databases, but the sort of people who want to denormalise a schema typically don't think about cardinality when (if) they set up indexes.

It is important to understand that table scans (examination of every row in a table in the course of producing a join) are rare in practice. A query optimiser will choose a table scan only when one or more of the following holds.

  • There are fewer than 200 rows in the relation (in this case a scan will be cheaper)
  • There are no suitable indexes on the join columns (if it's meaningful to join on these columns then why aren't they indexed? fix it)
  • A type coercion is required before the columns can be compared (WTF?! fix it or go home) SEE END NOTES FOR ADO.NET ISSUE
  • One of the arguments of the comparison is an expression (no index)

Performing an operation is more expensive than not performing it. However, performing the wrong operation, being forced into pointless disk I/O and then discarding the dross prior to performing the join you really need, is much more expensive. Even when the "wrong" operation is precomputed and indexes have been sensibly applied, there remains significant penalty. Denormalising to precompute a join - notwithstanding the update anomalies entailed - is a commitment to a particular join. If you need a different join, that commitment is going to cost you big.

If anyone wants to remind me that it's a changing world, I think you'll find that bigger datasets on gruntier hardware just exaggerates the spread of Date's findings.

For all of you who work on billing systems or junk mail generators (shame on you) and are indignantly setting hand to keyboard to tell me that you know for a fact that denormalisation is faster, sorry but you're living in one of the special cases - specifically, the case where you process all of the data, in-order. It's not a general case, and you are justified in your strategy.

You are not justified in falsely generalising it. See the end of the notes section for more information on appropriate use of denormalisation in data warehousing scenarios.

I'd also like to respond to

What a load of bollocks. Restrictions are applied as early as possible, most restrictive first. You've read the theory, but you haven't understood it. Joins are treated as "cartesian products to which predicates apply" only by the query optimiser. This is a symbolic representation (a normalisation, in fact) to facilitate symbolic decomposition so the optimiser can produce all the equivalent transformations and rank them by cost and selectivity so that it can select the best query plan.

The only way you will ever get the optimiser to produce a cartesian product is to fail to supply a predicate: SELECT * FROM A,B


Notes


David Aldridge provides some important additional information.

There is indeed a variety of other strategies besides indexes and table scans, and a modern optimiser will cost them all before producing an execution plan.

A practical piece of advice: if it can be used as a foreign key then index it, so that an index strategy is available to the optimiser.

I used to be smarter than the MSSQL optimiser. That changed two versions ago. Now it generally teaches me. It is, in a very real sense, an expert system, codifying all the wisdom of many very clever people in a domain sufficiently closed that a rule-based system is effective.


"Bollocks" may have been tactless. I am asked to be less haughty and reminded that math doesn't lie. This is true, but not all of the implications of mathematical models should necessarily be taken literally. Square roots of negative numbers are very handy if you carefully avoid examining their absurdity (pun there) and make damn sure you cancel them all out before you try to interpret your equation.

The reason that I responded so savagely was that the statement as worded says that

This may not be what was meant but it is what was written, and it's categorically untrue. A cartesian product is a relation. A join is a function. More specifically, a join is a relation-valued function. With an empty predicate it will produce a cartesian product, and checking that it does so is one correctness check for a database query engine, but nobody writes unconstrained joins in practice because they have no practical value outside a classroom.

I called this out because I don't want readers falling into the ancient trap of confusing the model with the thing modelled. A model is an approximation, deliberately simplified for convenient manipulation.


The cut-off for selection of a table-scan join strategy may vary between database engines. It is affected by a number of implementation decisions such as tree-node fill-factor, key-value size and subtleties of algorithm, but broadly speaking high-performance indexing has an execution time of k log n + c. The C term is a fixed overhead mostly made of setup time, and the shape of the curve means you don't get a payoff (compared to a linear search) until n is in the hundreds.


Sometimes denormalisation is a good idea

Denormalisation is a commitment to a particular join strategy. As mentioned earlier, this interferes with other join strategies. But if you have buckets of disk space, predictable patterns of access, and a tendency to process much or all of it, then precomputing a join can be very worthwhile.

You can also figure out the access paths your operation typically uses and precompute all the joins for those access paths. This is the premise behind data warehouses, or at least it is when they're built by people who know why they're doing what they're doing, and not just for the sake of buzzword compliance.

A properly designed data warehouse is produced periodically by a bulk transformation out of a normalised transaction processing system. This separation of the operations and reporting databases has the very desirable effect of eliminating the clash between OLTP and OLAP (online transaction processing ie data entry, and online analytical processing ie reporting).

An important point here is that apart from the periodic updates, the data warehouse is read only. This renders moot the question of update anomalies.

Don't make the mistake of denormalising your OLTP database (the database on which data entry happens). It might be faster for billing runs but if you do that you will get update anomalies. Ever tried to get Reader's Digest to stop sending you stuff?

Disk space is cheap these days, so knock yourself out. But denormalising is only part of the story for data warehouses. Much bigger performance gains are derived from precomputed rolled-up values: monthly totals, that sort of thing. It's always about reducing the working set.


ADO.NET problem with type mismatches

Suppose you have a SQL Server table containing an indexed column of type varchar, and you use AddWithValue to pass a parameter constraining a query on this column. C# strings are Unicode, so the inferred parameter type will be NVARCHAR, which doesn't match VARCHAR.

VARCHAR to NVARCHAR is a widening conversion so it happens implicitly - but say goodbye to indexing, and good luck working out why.


"Count the disk hits" (Rick James)

If everything is cached in RAM, JOINs are rather cheap. That is, normalization does not have much performance penalty.

If a "normalized" schema causes JOINs to hit the disk a lot, but the equivalent "denormalized" schema would not have to hit the disk, then denormalization wins a performance competition.

这篇关于数据库连接何时以及为何成本高昂?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-03 22:24
查看更多