问题描述
尝试在海量数据上更有效地使用索引.
我有一个开源应用程序,可以将数百万条记录记录到MySQL数据库中.我已经在Web开发中使用mysql数据库多年,并且我对选择有效的字段类型,为什么使用索引/如何使用索引的基本知识等了解得足够多-但是我们的应用程序记录的数据量庞大,而且很难做到这一点.准确地预测将要查询的列会让我有些困惑.
该应用程序记录玩家的事件.我们有一个非常先进的清除系统,但是有些服务器非常忙,仅在八周后它们就有5000万条记录.
在这种大小的情况下,如果使用我们现有的索引,查询可能仍需要30-90秒.
主表模式(减去现有索引):
CREATE TABLE IF NOT EXISTS `prism_data` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`epoch` int(10) unsigned NOT NULL,
`action_id` int(10) unsigned NOT NULL,
`player_id` int(10) unsigned NOT NULL,
`world_id` int(10) unsigned NOT NULL,
`x` int(11) NOT NULL,
`y` int(11) NOT NULL,
`z` int(11) NOT NULL,
`block_id` mediumint(5) DEFAULT NULL,
`block_subid` mediumint(5) DEFAULT NULL,
`old_block_id` mediumint(5) DEFAULT NULL,
`old_block_subid` mediumint(5) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
WHERE
条件通常包括:
-
world_id/x/y/z
坐标(所有查询均默认为用户周围的半径,因此几乎总是使用坐标) -
epoch
(所有查询默认为最近三天,用户需要在更长的时间内覆盖它) -
action_id
和/或player_id
(有一半时间,用户正在寻找谁执行了特定操作或特定玩家造成了哪些操作.) - 剩余的查询可以是任意组合,block_id值与玩家或动作组合等.随机
GROUP BY
-默认情况下,应用程序按某些字段分组,以使用户不会看到同一播放器/动作/块的100个重复事件,他们只能看到一个带有计数的记录.
action_id
,player_id
,block_id
,DATE(FROM_UNIXTIME(epoch))
ORDER BY
始终为prism_data.epoch DESC, x ASC, z ASC, y ASC, id DESC
. epoch
是这样的,以便用户首先看到最近的事件.其余的都是为了使回滚"引擎以正确的顺序进行操作.
这是一个不带订单/组的示例查询:
SELECT *
FROM prism_data
INNER JOIN prism_players p ON p.player_id = prism_data.player_id
INNER JOIN prism_actions a ON a.action_id = prism_data.action_id
INNER JOIN prism_worlds w ON w.world_id = prism_data.world_id
LEFT JOIN prism_data_extra ex ON ex.data_id = prism_data.id
WHERE w.world = 'DeuxTiersMondes'
AND (prism_data.x BETWEEN 668 AND 868)
AND (prism_data.y BETWEEN -33 AND 167)
AND (prism_data.z BETWEEN 358 AND 558);
LIMIT 1000;
使用索引:INDEX
location (
world_id ,
x ,
z ,
y );
仍然需要15秒才能找到1000行(或者需要50秒才能找到全部64735). /p>
该查询的说明:
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
| 1 | SIMPLE | w | ref | PRIMARY,world | world | 767 | const | 1 | Using where; Using index |
| 1 | SIMPLE | prism_data | ref | location | location | 4 | minecraft.w.world_id | 6155 | Using index condition |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 4 | minecraft.prism_data.action_id | 1 | NULL |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | minecraft.prism_data.player_id | 1 | NULL |
| 1 | SIMPLE | ex | ref | data_id | data_id | 4 | minecraft.prism_data.id | 1 | NULL |
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
在我看来,寻找此特定值应该更快.我们甚至没有在此查询中进行排序/分组.
我的问题:
我认为为上面列出的每个常见条件设计一个索引最有意义.即一个结合了world_id/x/y/z
的索引,一个结合了action_id/player_id
的索引和一个结合了epoch
的索引.对于某些查询,此方法效果很好,但对于其他查询,则无效.对于使用world_id, player_id, and epoch
的查询,它仅选择world_id/x/y/z
索引.
- 我可以/应该在多个索引中包括一列吗?也许一个索引代表完整位置,而另一个索引代表
world_id/player_id/epoch
?我无法真正确定mysql使用哪种逻辑来选择最适合的索引,但是我假设如果一个索引使用了mysql需要的更多列,它将选择那个.如果对我的查询有帮助,那么在写操作上有一些细微的表现是值得的. - 是否应该创建一个索引,其中包含我按/分组的所有字段?我的解释经常显示
Using filesort
,这是我表现的主要痛点. - 即使大多数字段位于组合索引中,在大多数字段上使用单个索引是否有好处?
很抱歉长时间阅读.
我正在对使用不同索引设置的5个最常见查询进行大量分析,但感觉到我可能缺少一些基础知识.我宁愿让一些真正的专家对我进行学习,然后再继续进行.
MySQL(和其他RDMS系统)很好地利用了覆盖索引.因此,如果您正在查找,请使用您的示例,
SELECT prism_data.id,
prism_data.action_id,
prism_data.world_id
FROM prism_data
INNER JOIN prism_worlds w ON w.world_id = prism_data.world_id
WHERE w.world = 'DeuxTiersMondes'
AND (prism_data.x BETWEEN 668 AND 868)
AND (prism_data.y BETWEEN -33 AND 167)
AND (prism_data.z BETWEEN 358 AND 558)
ORDER BY prism_data.id DESC
LIMIT 1000;
prism_data上的以下BTREE索引可能会帮助提高查询性能(几乎所有索引都是BTREE索引):
(world_id, x, y, z, id, action_id, world_id)
仅从索引就可以满足整个关于Arizona_data的查询.之所以称为 covering index (覆盖索引),是因为服务器可以找到它需要满足的所有内容(包括覆盖索引中的查询),因此不必重新启动数据表本身.它将对world_id进行索引身份扫描,然后对x进行范围扫描,然后查看y和z值以匹配其余查询.然后它将提取ID值,对其进行排序,然后返回LIMIT 1000部分结果集.
您绝对应该停止使用SELECT *
.当您说SELECT *
时,您将拒绝MySQL有关您实际需要哪些数据列的任何知识,因此您将击败选择覆盖原始表查询的优化器逻辑.
如果您的数据相当均匀地分布在x和y上,并且可以使用MyISAM,则可能需要考虑使用地理空间索引.与普通索引相比,它们在随机访问x/y范围方面做得更好.
精心设计的索引设置会减慢插入和更新的速度;绝对是个权衡.
Trying to use indexes more efficiently on massive data.
I have an open source application that logs millions of records to a MySQL database. I've used mysql databases for years in web development and I understand enough about choosing efficient field types, the basics of why/how indexes are useful, etc - but the sheer volume of data our application logs combined with the fact that it's hard to predict exactly which columns will be queried has me a bit under water.
The application logs events by players. We have a very advanced purge system but some servers are so busy, they have 50 million records after just eight weeks.
At that size, event with our existing indexes, queries may still take 30-90 seconds.
The primary table schema (minus existing indexes):
CREATE TABLE IF NOT EXISTS `prism_data` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`epoch` int(10) unsigned NOT NULL,
`action_id` int(10) unsigned NOT NULL,
`player_id` int(10) unsigned NOT NULL,
`world_id` int(10) unsigned NOT NULL,
`x` int(11) NOT NULL,
`y` int(11) NOT NULL,
`z` int(11) NOT NULL,
`block_id` mediumint(5) DEFAULT NULL,
`block_subid` mediumint(5) DEFAULT NULL,
`old_block_id` mediumint(5) DEFAULT NULL,
`old_block_subid` mediumint(5) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
WHERE
conditions most often include:
world_id/x/y/z
coordinates (queries all default to a radius around the user, so coordinates are almost always used)epoch
(all queries default to the last three days, users need to override this for longer timeframes)action_id
and/orplayer_id
(Half the time, users are looking for who did specific actions or for what actions a specific player caused. )- Remaining queries could be any combination, block_id values combined with player or action, etc. Random
GROUP BY
- By default the application groups by certain fields so that the user doesn't see 100 duplicate events for the same player/action/block, they can just see a single record with a count.
action_id
, player_id
, block_id
, DATE(FROM_UNIXTIME(epoch))
ORDER BY
is always prism_data.epoch DESC, x ASC, z ASC, y ASC, id DESC
. The epoch
is so that the user sees the most recent events first. The rest are so that a "rollback" engine gets things in the right order.
Here is an example query without order/group:
SELECT *
FROM prism_data
INNER JOIN prism_players p ON p.player_id = prism_data.player_id
INNER JOIN prism_actions a ON a.action_id = prism_data.action_id
INNER JOIN prism_worlds w ON w.world_id = prism_data.world_id
LEFT JOIN prism_data_extra ex ON ex.data_id = prism_data.id
WHERE w.world = 'DeuxTiersMondes'
AND (prism_data.x BETWEEN 668 AND 868)
AND (prism_data.y BETWEEN -33 AND 167)
AND (prism_data.z BETWEEN 358 AND 558);
LIMIT 1000;
Using an index: INDEX
location(
world_id,
x,
z,
y);
it still takes 15 seconds to find 1000 rows (or 50 seconds to find all 64735).
The explain for that query:
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
| 1 | SIMPLE | w | ref | PRIMARY,world | world | 767 | const | 1 | Using where; Using index |
| 1 | SIMPLE | prism_data | ref | location | location | 4 | minecraft.w.world_id | 6155 | Using index condition |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 4 | minecraft.prism_data.action_id | 1 | NULL |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | minecraft.prism_data.player_id | 1 | NULL |
| 1 | SIMPLE | ex | ref | data_id | data_id | 4 | minecraft.prism_data.id | 1 | NULL |
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
It just seems to me that looking for values this specific ought to be much faster. We're not even sorting/grouping in this query.
My questions:
I assume it makes the most sense to design an index for each of the common conditions I listed above. i.e. one index that combines world_id/x/y/z
, one that combines action_id/player_id
and one for epoch
. For certain queries this works well but for others it doesn't. For a query that used world_id, player_id, and epoch
it only chose the world_id/x/y/z
index.
- Can I/should I include a column in multiple indices? Maybe one index for the full location, and one for the
world_id/player_id/epoch
? I can't really tell what logic mysql uses to choose which index fits best but I assume if an index uses more columns mysql needs, it'll chose that one. A slight perf hit on write is worth it if that will help my queries. - Should I create an index which contains all of the fields I group by / order by? My explains often show
Using filesort
which I know is a main pain point for performance. - Is there any benefit to using individual indexes on most fields even if they're in a combined index?
Sorry for the long read.
I'm doing a lot of profiling for 5 of our most common queries with different index setups but have a feeling that I may be missing some basics. I'd rather have some true experts school me in something I'm missing before I continue.
MySQL (and other RDMS systems) makes good use of covering indexes. So, if you're looking up, to use your example,
SELECT prism_data.id,
prism_data.action_id,
prism_data.world_id
FROM prism_data
INNER JOIN prism_worlds w ON w.world_id = prism_data.world_id
WHERE w.world = 'DeuxTiersMondes'
AND (prism_data.x BETWEEN 668 AND 868)
AND (prism_data.y BETWEEN -33 AND 167)
AND (prism_data.z BETWEEN 358 AND 558)
ORDER BY prism_data.id DESC
LIMIT 1000;
The following BTREE index on prism_data will probably help a bunch with query performance (almost all indexes are BTREE indexes):
(world_id, x, y, z, id, action_id, world_id)
The whole of this query on prism_data can be satisfied just from the index. It's called a covering index because the server can find everything it needs to satisfy -- to cover -- the query in the index, and so doesn't have to bounce over to the data table itself. It'll do an index identity scan on world_id, then a range scan on x, and then look at the y, and z values for matching the rest of your query. It will then pull out the id values, order them, and return the LIMIT 1000 partial result set.
You should absolutely stop using SELECT *
. When you say SELECT *
you deny MySQL any knowledge of what columns of data you actually need, so you defeat the optimizer's logic that chooses covering index queries over raw table queries.
If your data are fairly evenly distributed over x and y, and you can use MyISAM, you may want to look into using geospatial indexes. These do a better job of random-accessing x/y ranges than ordinary indexes.
Elaborate index setups do slow down insertion and update; it's definitely a tradeoff.
这篇关于提高MySQL索引效率-多个索引中的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!