问题描述
我正在用PHP/MySQL编写网站代码,并且想要实现一个类似于stackoverflow标记引擎的代码.我在数据库中有3个相关表:1.物品2.标签3. ItemTagMap(将标签映射到项目,n:n映射)
I'm coding a website in PHP/MySQL and I'd like to implement a similar to stackoverflow tagging engine. I have 3 relevant tables in DB:1. Items2. Tags3. ItemTagMap (maps tags to items, n:n mapping)
现在,我想在搜索页面上显示整个搜索结果的所有标签的唯一列表(不仅仅是当前页面),以便用户可以通过在该标签列表中添加/删除标签来优化"他们的搜索.
Now, on search page I'd like to show distinct list of all tags for entire search result (not just the current page), so that users can "refine" their search by adding/removing tags from that tag list.
问题在于,这是对数据库的一个非常繁重的查询,并且可能存在大量的搜索请求,从而导致不同的结果集以及不同的标记集.
The question is that it's a pretty heavy query on the DB and there can be tons of search requests that result in different result sets and thus different tag sets.
有人知道如何有效地实现这一目标吗?
Does anyone know how to implement this effectively?
推荐答案
在进入过早优化模式之前,研究以下查询模板可能会很有用.如果没有其他选择,它可以用作基准,以此来衡量可能的优化效果.
Before we go into premature optimization mode, it may be useful to look into the following query template. If nothing else this could be used as a baseline against which the effectiveness of possible optimizations can be measured.
SELECT T.Tagid, TagInfo.TagName, COUNT(*)
FROM Items I
JOIN Tags TagInfo ON TagInfo.TagId = T.TagId
JOIN ItemTagMap T ON I.ItemId = T.ItemId
--JOIN ItemTagMap T1 ON I.ItemId = T1.ItemId
WHERE I.ItemId IN
(
SELECT ItemId
FROM Items
WHERE -- Some typical initial search criteria
Title LIKE 'Bug Report%' -- Or some fulltext filter instead...
AND ItemDate > '02/22/2008'
AND Status = 'C'
)
--AND T1.TagId = 'MySql'
GROUP BY T.TagId, TagInfo.TagName
ORDER BY COUNT(*) DESC
子查询是驾驶查询",即与最终用户的初始标准相对应的查询. (有关此查询如何进行的详细信息,请参阅下文,这可能需要多次才能满足整体优化流程的要求)注释了T1上的JOIN(如果选择了几个标签,则可能是T2,T3),以及带有WHERE子句的关联条件.当用户选择特定标签时,无论是作为初始搜索的一部分还是通过细化,都需要这些标签. (将这些联接和where子句放在子查询中可能会更有效率;有关这些的更多信息,请参见下文)
The subquery is the "driving query", i.e. the one corresponding to the end-user's initial criteria. (see below for details on how this query, required multiple times may fit in an overall optimized flow)Commented is the JOIN on T1 (and possibly T2, T3, when several tags are selected), and, with the WHERE clause, the associated criteria. These are needed when the user selects a particular tag, whether as part of the initial search or by refinement. (It may be more efficient to place these joins and where clauses within the sub-query; more on these below)
讨论... 为了两个不同的目的,需要驾驶查询"或其变体:
Discussion...The "driving query", or a variation thereof is needed for two distinct purposes:
-
1提供ItemId的 complete 列表,该列表需要枚举所有关联的标签.
-
2提供前N个ItemId值(N是显示页面的大小),目的是在项目"表中查找项目详细信息".
1 to provide the complete list of ItemId which is needed to enumerate all associated tags.
2 to provide the first N ItemId values (N being the display page size), for the purpose of looking up Item detail info in the Item table.
请注意,不需要对完整列表进行排序(否则可能会受益于以不同顺序排序),因此第二个列表需要根据用户的选择进行排序(例如按日期,降序或标题) ,按字母顺序升序).还应注意,如果需要任何排序顺序,则查询的成本将意味着处理完整列表(由于SQL本身的奇数优化和/或某些非规范化,SQL需要查看"该表的最后一条记录).列表,以防它们排在首位).
Note that the complete list doesn't need to be sorted (or it may benefit from sorting in a different order), whereby the second list needs to be sorted based on the user's choice (say by Date, descending or by Title, alphabetically ascending). Also note that if there is any sort order required, the cost of the query will imply dealing with the complete list (shy of odd optimization by SQL itself, and/or some denormalization, SQL needs to "see" the last records on that list, in case they belong to the top, sort-wise).
后一个事实有利于在两个目的上使用完全相同的查询,因此可以将相应的列表存储在临时表中.通常的流程是快速查找前N个Item记录及其详细信息,并将其立即返回给应用程序.然后,应用程序可以获取ajax-fashion标签列表以进行优化.该列表将通过与上面的查询类似的查询来产生,其中子查询由来自临时表的select *"代替. SQL优化器决定对列表进行排序(在某些情况下)的可能性很大,让我们来做到这一点,而不是第二次猜测并明确地对其进行排序.
This latter fact, is in favor of having the very same query for both purposes, the corresponding list can be stored in a temporary table. The general flow would be to quickly lookup the top N Item records with their details and returns this to the application at once. The application can then obtain ajax-fashion the list of Tags for refinements. This list would be produce with a query akin the one above, where the subquery is replaced by a "select * from temporaryTable." The odds are good that the SQL optimizer will decide to sort this list (in some cases), let's let it do that, rather than second guessing it and sorting it explicitly.
要考虑的另一点是可能将ItemTagMap表上的联接引入驱动查询"内部.而不是如上所示.这样做可能是最好的选择,这既是为了提高性能,又是因为它会为第二个目的(显示项目页面)生成正确的列表.
One other point to consider is to maybe bring the join(s) on ItemTagMap table inside the "driving query" rather that as shown above. It is probably best to do so, both for performance, and because it will produce the right list for the #2 purpose (display of a page of items).
即使在相对适度的硬件上,上述查询/流也可能会很好地扩展;暂时搜索1/2百万+个项目,持续的用户搜索速度可能高达每秒10个.关键因素之一是初始搜索条件的选择性.
The query/flow described above will likely scale rather well, even on relatively modest hardware; tentatively into the 1/2 Million+ Items, with sustained user searches maybe up to 10 per second. One of the key factor would be the selectivity of the initial search criteria.
优化思路
- [取决于典型的搜索案例和数据统计信息],可以通过将(确实是重复的)某些Item字段带到ItemTagMap表中来进行规范化.尤其是短字段可能会在这里受欢迎".
- 随着数据在百万以上项目中的增长,我们可以利用各种技巧来利用某些标签的典型强相关性(例如:在SO中,PHP通常带有MySql,而btw通常没有充分的理由...).例如,引入多标签". TagIds可以使输入逻辑更加复杂,但也可以显着减小Map的大小.
-'说得好! -
应该根据实际需求和有效的数据统计资料来选择适当的体系结构和优化.
-- 'nough said! --
Appropriate architecture and optimizations should be selected in light of the actual requirements and of the effective data statistical profile...
这篇关于如何在php/mysql中实现类似于SO的标记系统?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!