问题描述
我之前曾问过一个有关如何分析大型数据集的问题(分析13GB的数据).一种有希望的响应是使用自然键将数据添加到MySQL数据库中,从而利用INNODB的聚集索引.
I previously asked a question on how to analyse large datasets (how can I analyse 13GB of data). One promising response was to add the data into a MySQL database using natural keys and thereby make use of INNODB's clustered indexing.
我已使用如下所示的架构将数据添加到数据库中:
I've added the data to the database with a schema that looks like this:
TorrentsPerPeer
+----------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+-------+
| ip | int(10) unsigned | NO | PRI | NULL | |
| infohash | varchar(40) | NO | PRI | NULL | |
+----------+------------------+------+-----+---------+-------+
这两个字段共同构成主键.
The two fields together form the primary key.
此表表示对等下载洪流的已知实例.我希望能够提供有关在同龄人中可以找到多少种子的信息.我要绘制一个直方图,显示我看到的洪流数量(例如20个同行有2个洪流,40个同行有3个,...).
This table represents known instances of peers downloading torrents. I'd like to be able to provide information on how many torrents can be found at peers. I'm going to draw a histogram of the frequencies of which I see numbers of torrents (e.g. 20 peers have 2 torrents, 40 peers have 3, ...).
我写了以下查询:
SELECT `count`, COUNT(`ip`)
FROM (SELECT `ip`, COUNT(`infohash`) AS `count`
FROM TorrentsPerPeer
GROUP BY `ip`) AS `counts`
GROUP BY `count`;
这是子选择的EXPLAIN
:
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_length | ref | rows | Extra |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| 1 | SIMPLE | TorrentPerPeer | index | [Null] | PRIMARY | 126 | [Null] | 79262772 | Using index |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
我似乎无法对完整查询执行EXPLAIN
,因为它花费的时间太长.此 bug 表示这是因为它首先运行了子查询.
I can't seem to do an EXPLAIN
for the full query because it takes way too long. This bug suggests it's because it's running the sub query first.
此查询当前正在运行(并且已经运行了一个小时). top
报告说,mysqld
仅使用约5%的可用CPU,而RSIZE
却在稳定增加.我的假设是服务器正在RAM中建立临时表,以用于完成查询.
This query is currently running (and has been for an hour). top
is reporting that mysqld
is only using ~5% of the available CPU whilst its RSIZE
is steadily increasing. My assumption here is that the server is building temporary tables in RAM that it's using to complete the query.
我的问题是;如何改善此查询的性能?我应该以某种方式更改查询吗?我一直在更改my.cnf
文件中的服务器设置以增加INNODB缓冲池的大小,我是否应该更改其他任何值?
My question is then; how can I improve the performance of this query? Should I change the query somehow? I've been altering the server settings in the my.cnf
file to increase the INNODB buffer pool size, should I change any other values?
如果很重要,该表的深度为79'262'772行,并占用约8GB的磁盘空间.我不希望这是一个简单的查询,也许耐心"是唯一合理的答案.
If it matters the table is 79'262'772 rows deep and takes up ~8GB of disk space. I'm not expecting this to be an easy query, maybe 'patience' is the only reasonable answer.
编辑,仅需补充说,查询已完成,花费了105分钟.这不是无法忍受的,我只是希望有所改进.
EDIT Just to add that the query has finished and it took 105mins. That's not unbearable, I'm just hoping for some improvements.
推荐答案
我的直觉是,使用无符号int和varchar 40(尤其是varchar!),您现在拥有一个巨大的主键,它也使您的索引文件成为可能大到可以容纳您拥有Innodb_buffer_pool的任何RAM.这将使InnoDB在搜索时不得不依靠磁盘来交换索引页,这是很多磁盘的搜寻工作,而不是大量的CPU工作.
My hunch is that with an unsigned int and a varchar 40 (especially the varchar!) you have now a HUGE primary key and it is making your index file too big to fit in whatever RAM you have for Innodb_buffer_pool. This would make InnoDB have to rely on disk to swap index pages as it searches and that is a LOT of disk seeks and not a lot of CPU work.
我为类似问题所做的一件事是在真正的自然钥匙和替代钥匙之间使用东西.我们将采用实际上唯一的2个字段(其中之一也是varchar),并且在应用程序层中将进行固定宽度的MD5哈希并使用THAT作为键.是的,这意味着该应用程序需要做更多的工作,但是由于您不再使用任意长度字段,因此它可以使索引文件更小.
One thing I did for a similar issue is use something in between a truly natural key and a surrogate key. We would take the 2 fields that are actually unique (one of which was also a varchar) and in the application layer would make a fixed width MD5 hash and use THAT as the key. Yes, it means more work for the app but it makes for a much smaller index file since you are no longer using an arbitrary length field.
或者,您可以只使用一台具有大量RAM的服务器,然后看看这是否使索引适合内存,但我总是想让为此扔硬件"是不得已的方法:)
OR, you could just use a server with tons of RAM and see if that makes the index fit in memory but I always like to make 'throw hardware at it' a last resort :)
这篇关于在具有大数据集的一次性查询上提高MySQL性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!