问题描述
我最近参与了一个使用SQL Server 2000进行数据存储的新软件项目。在审查项目时,我发现其中一个主表在其主键上使用聚集索引,其中包含四列:
序列数字(18,0)
日期datetime
客户端varchar(9)
Hash tinyint
此表体验了很多现在,我是一个C ++开发人员,而不是一个DB管理员,但是我对该表设计的第一印象是:这些字段作为聚集索引将非常不利于插入性能,因为数据必须在每个插入物上重新排序。
此外,我不能真的看到任何好处,因为人们必须频繁地查询所有这些字段来证明聚集索引是正确的?对于这个,我基本上需要一些弹药对于说服他们的权力,应该改变表格设计。
聚簇索引应包含列)最多查询给最大的机会寻求或使非集群索引覆盖查询中的所有列。
主键和聚簇索引没有是一样的它们都是候选键,表中通常有不止一个这样的键。
你说
这不是真的。只需使用第一列或两个聚集索引即可。它可能是一个范围寻求,但它仍然是一个寻求。您不必指定它的所有列,以获得这个好处。但列的顺序很重要。如果您主要在客户端查询,则序列列是聚集索引中的第一个选项。第二列的选择应该是与第一列(而不是它自己)最多查询的项目。如果你发现第二列本身几乎和第一列一样被查询,那么非聚集索引就会有所帮助。
正如其他人所说,减少数量尽可能多的集群索引中的列/字节很重要。
这个序列是一个随机值而不是递增,但这可能无法得到帮助答案不是抛出一个标识列,除非您的应用程序可以开始将其用作此表上的主要查询条件(不太可能)。现在,由于你被这个随机的Sequence列(假设它是最常被查询的)被困住了,我们来看另外一个陈述:
这不是完全正确的。
磁盘上的物理位置并不是我们在这里所说的,而是在碎片化方面发挥作用,其中是性能暗示。
每个8k页面中的行不排序。只是每个页面中的所有行都比下一页更多,而不是上一行。当插入行并且页面已满时,会出现此问题:您将获得页面拆分。引擎必须将插入的行后的所有行复制到新页面,这可能是昂贵的。使用随机密钥,您将得到很多页面分割。您可以通过在重建索引时使用较低的fillfactor来改善问题。您必须使用它来获得正确的数字,但70%或60%可能会比90%更好。
我相信将datetime作为第二个CI列可能是有益的,因为您仍然需要处理需要在两个不同序列值之间分割的页面,但是并不像CI中的第二列是随机的那样糟糕,因为您将被保证在每个插入页面上分页,其中以升值,您可以获得幸运,如果该行可以添加到一个页面,因为下一个序列号从下一页开始。
缩短表中所有列的数据类型和数量以及其非聚簇索引可以提高性能,因为每页更多行=每个请求的页面读取更少。特别是如果引擎被迫做桌面扫描。将一堆很少查询的列移动到单独的1-1表可能会对您的一些查询产生疑问。
最后,有一些设计调整可以帮助好的(在我看来):
- 将Sequence列更改为bigint以保存每行的字节(8字节而不是9数字)。
- 对于Client,使用4字节的int标识列而不是varchar(9)的查找表。这可以节省每行5个字节。如果可能,请使用2个字节的smallint(-32768至32767),每行可节省7个字节。
摘要:CI应从最多查询的列开始。从CI中删除任何列。尽可能多地缩短列(字节)。使用一个较低的fillfactor来减轻随机Sequence列引起的页面拆分(如果由于最多被查询而必须保持首先)。
哦,让你的在线碎片整理如果表不能更改,至少可以经常重组,以保持最佳状态。不要忽视统计数据,所以引擎可以选择适当的执行计划。
更新
另一种考虑的策略是如果表中使用的复合密钥可以转换为int,并创建值的查找表。让我们说,在超过100行的情况下,重复少于全部4列的某些组合,例如Sequence + Client + Hash,但只有变化的Date值。然后插入到具有标识列的单独的SequenceClientHash表可能是有意义的,因为您可以查找人造键一次并重复使用它。这也可以让您的CI仅在最后一页(yay)添加新行,并大大减少所有非聚集索引(yippee)中重复的CI大小。但是,这在某些狭义的使用模式中只会有意义。
现在,marc_s建议只需添加一个附加的int identity列作为聚簇索引。这可能有助于使所有非聚簇索引在每页上获得更多行,但这一切都完全取决于您希望执行性能的位置,因为这将保证表上的每一个查询都必须使用书签查找,你永远不会得到一个表寻求。
关于吨的页面分割和不良索引碎片:正如我已经说过的,可以稍微改进一点,因子。此外,频繁的在线索引重组(与重建不一样)可以帮助减少这种影响。
最终,这一切都归结于确切的系统及其独特的模式的数据访问以及您想要优化哪些部分的决策。对于某些系统来说,只要选择总是快速,插入速度较慢就不会坏。对于其他人来说,选择时间一致但稍慢的选择时间比选择时间稍长,但不一致更为重要。对于其他人来说,数据直到被推送到数据仓库才能真正读取,因此插入需要尽可能快。而混合增加的事实是,性能不仅仅是用户等待时间,甚至是查询响应时间,而且关于服务器资源,特别是在大规模并行性的情况下,所以总吞吐量(比如,在每个时间单位的客户端响应中)比任何其他因素更重要。
I recently became involved with a new software project which uses SQL Server 2000 for its data storage.
In reviewing the project, I discovered that one of the main tables uses a clustered index on its primary key which consists of four columns:
Sequence numeric(18, 0)
Date datetime
Client varchar(9)
Hash tinyint
This table experiences a lot of inserts in the course of normal operation.
Now, I'm a C++ developer, not a DB Admin, but my first impression of this table design was that that having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.
In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?
So basically I need some ammunition for when I go to the powers that be to convince them that the table design should be changed.
The clustered index should contain the column(s) most queried by to give the greatest chance of seeks or of making a nonclustered index cover all the columns in the query.
The primary key and the clustered index do not have to be the same. They are both candidate keys, and tables often have more than one such key.
You said
That's not true. A seek can be had just by using the first column or two of the clustered index. It may be a range seek, but it's still a seek. You don't have to specify all the columns of it in order to get that benefit. But the order of the columns does matter a lot. If you're predominantly querying on Client, then the Sequence column is a bad choice as the first in the clustered index. The choice of the second column should be the item that is most queried in conjunction with the first (not by itself). If you find that a second column is queried by itself almost as often as the first column, then a nonclustered index will help.
As others have said, reducing the number of columns/bytes in the clustered index as much as possible is important.
It's too bad that the Sequence is a random value instead of incrementing, but that may not be able to be helped. The answer isn't to throw in an identity column unless your application can start using it as the primary query condition on this table (unlikely). Now, since you're stuck with this random Sequence column (presuming it IS the most often queried), let's look at another of your statements:
That's not entirely true.
The physical location on the disk is not really what we're talking about here, but it does come into play in terms of fragmentation, which is a performance implication.
The rows inside each 8k page are not ordered. It's just that all the rows in each page are less than the next page and more than the previous one. The problem occurs when you insert a row and the page is full: you get a page split. The engine has to copy all the rows after the inserted row to a new page, and this can be expensive. With a random key you're going to get a lot of page splits. You can ameliorate the problem by using a lower fillfactor when rebuilding the index. You'd have to play with it to get the right number, but 70% or 60% might serve you better than 90%.
I believe that having datetime as the second CI column could be beneficial, since you'd still be dealing with pages needing to be split between two different Sequence values, but it's not nearly as bad as if the second column in the CI was also random, since you'd be guaranteed to page split on every insert, where with an ascending value you can get lucky if the row can be added to a page because the next Sequence number starts on the next page.
Shortening the data types and number of all columns in a table as well as its nonclustered indexes can boost performance too, since more rows per page = fewer page reads per request. Especially if the engine is forced to do a table scan. Moving a bunch of rarely-queried columns to a separate 1-1 table could do wonders for some of your queries.
Last, there are some design tweaks that could help as well (in my opinion):
- Change the Sequence column to a bigint to save a byte for every row (8 bytes instead of 9 for the numeric).
- Use a lookup table for Client with a 4-byte int identity column instead of a varchar(9). This saves 5 bytes per row. If possible, use a smallint (-32768 to 32767) which is 2 bytes, an even greater savings of 7 bytes per row.
Summary: The CI should start with the column most queried on. Remove any columns from the CI that you can. Shorten columns (bytes) as much as you can. Use a lower fillfactor to mitigate the page splits caused by the random Sequence column (if it has to stay first because of being queried the most).
Oh, and get your online defragging going. If the table can't be changed, at least it can be reorganized frequently to keep it in best possible shape. Don't neglect statistics, either, so the engine can pick appropriate execution plans.
UPDATE
Another strategy to consider is if the composite key used in the table can be converted to an int, and a lookup table of the values is created. Let's say some combination of less than all 4 columns is repeated in over 100 rows, for example, Sequence + Client + Hash but only with varying Date values. Then an insert to a separate SequenceClientHash table with an identity column could make sense, because then you could look up the artificial key once and use it over and over again. This would also get your CI to add new rows only on the last page (yay) and substantially reduce the size of the CI as repeated in all nonclustered indexes (yippee). But this would only make sense in certain narrow usage patterns.
Now, marc_s suggested just adding an additional int identity column as the clustered index. It is possible that this could help by making all the nonclustered indexes get more rows per page, but it all depends on exactly where you want the performance to be, because this would guarantee that every single query on the table would have to use a bookmark lookup and you could never get a table seek.
About "tons of page splits and bad index fragmentation": as I already said this can be ameliorated somewhat with a lower fill factor. Also, frequent online index reorganization (not the same as rebuilding) can help reduce the effect of this.
Ultimately, it all comes down to the exact system and its unique pattern of data access combined with decisions about which parts you want optimized. For some systems, having a slower insert isn't bad as long as selects are always fast. For others, having consistent but slightly slower select times is more important than having slightly faster but inconsistent select times. For others, the data isn't really read until it's pushed to a data warehouse anyway so the inserts need to be as fast as possible. And adding into the mix is the fact that performance isn't just about user wait time or even query response time but also about server resources especially in the case of massive parallelism, so that total throughput (say, in client responses per time unit) matters more than any other factor.
这篇关于集群索引对数据库性能的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!