问题描述
过去 6 个月以来,我在一个高流量网站上遇到了一个让我非常伤心的查询,我迫切需要帮助.我是一名前端开发人员,能够编写简单的 SQL 查询,因此我无法自己解决此问题.由于内存或 CPU 不足或 VPS 上其他查询的干扰,该查询现在经常锁定 mysql 数据库.我升级了硬件,但这并不能解决问题.因此,这里描述了查询尝试执行的操作:
I desperately need help with a query that's been causing a lot of grief over the past 6 months on a high traffic website. I am a frontend developer with ability to write simple SQL queries and so I am unable to fix this issue on my own. The query is now locking mysql database frequently for lack of memory or CPU or interference from other queries on the VPS. I upgraded the hardware, but that alone didn't resolve the issue. So here's a description of what the query is trying to do:
用户访问特定 URL(例如 the_source_url).应用程序尝试获取其他用户(也访问过 the_source_url)之前访问过的相关 source_url,按照最常访问到最不常访问的顺序进行排序.基本上,应用程序试图找到具有相似兴趣的用户并显示他们之前访问过的其他页面.
User accesses a particular URL (say, the_source_url). Application tries to fetch related source_urls that other users (who have also accessed the_source_url) have accessed earlier, sorted by most frequently accessed to least frequently accessed. Basically, application is trying to find users with similar interest and show other pages they have accessed earlier.
这是我年轻/愚蠢,网站没有流量时写的复杂查询":
This is the "complex query" that I wrote when I was young/stupid and the website had no traffic:
SELECT DISTINCT(SOURCE_URL), COUNT(SOURCE_URL) CATCOUNT
FROM topsources
WHERE SOURCE_URL <> ?
AND USER_ID IN (SELECT DISTINCT(USER_ID)
FROM topsources WHERE SOURCE_URL = ?)
GROUP BY SOURCE_URL ORDER BY CATCOUNT DESC
这是表结构:
`topsources` (
`USER_ID` varchar(255) NOT NULL,
`DATE_AND_HOUR` varchar(255) NOT NULL,
`UPDATED_TIME` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`ITEM_ID` int(11) NOT NULL,
`SOURCE_URL` varchar(100) NOT NULL,
`FEED_PAGE_URL` varchar(255) NOT NULL,
`CATEGORY_URL` varchar(100) NOT NULL,
`REFERRER` varchar(2048) DEFAULT NULL,
PRIMARY KEY (`USER_ID`,`DATE_AND_HOUR`(30),`ITEM_ID`),
KEY `USER_ID` (`USER_ID`),
KEY `FEED_PAGE_URL` (`FEED_PAGE_URL`),
KEY `SOURCE_URL` (`SOURCE_URL`),
KEY `CATEGORY_URL` (`CATEGORY_URL`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
请注意,我已经尝试过使用内部连接而不是子选择,但这不起作用.即,下面的查询返回的结果与上面的查询不同.
Please note that I have already tried doing inner join instead of subselect but that doesn't work. i.e., the below query doesn't return the same result as that of the above query.
SELECT DISTINCT(ts.SOURCE_URL), COUNT(ts.SOURCE_URL) CATCOUNT FROM topsources ts INNER JOIN topsources tsi ON ts.USER_ID = tsi.USER_ID AND tsi.SOURCE_URL = ? WHERE ts.SOURCE_URL <> ? AND ts.CATEGORY_URL = ? GROUP BY ts.SOURCE_URL ORDER BY ts.CATCOUNT DESC
推荐答案
IN 子选择未在 MySQL 中优化.MySQL 不是先执行子选择,然后再执行外部选择,而是为外部选择匹配的每一行执行子选择.因此,应将 IN 子句中的子选择替换为连接.这是查询的更快版本:
IN subselects are not optimised in MySQL. Instead of performing the subselect, and then the outser select, MySQL performs the subselect for each row matched by the outer select. Because of this, subselects in the IN clause should be replaced with joins. Here is a faster version of the query:
SELECT DISTINCT(SOURCE_URL), COUNT(SOURCE_URL) CATCOUNT
FROM topsources
INNER JOIN
(SELECT DISTINCT(USER_ID)
FROM topsources WHERE SOURCE_URL = ?) as t
ON (topsources.USER_ID = t.USER_ID)
WHERE SOURCE_URL <> ?
GROUP BY SOURCE_URL ORDER BY CATCOUNT DESC
这篇关于需要帮助优化 SQL 查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!