问题描述
我正在尝试找出访问存储在连接对象中的数据的最快方法.下面的示例与我的问题类似,但上下文不同,因为我处理的实际数据集的关系有些不直观.
I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
我们有 3 个类:User
、Product
和 Rating
.用户与 Product
具有多对多关系,Rating
作为连接/通过"类.
We have 3 classes: User
, Product
, and Rating
. User has a many-to-many relationship to Product
with Rating
as the junction/'through' class.
Rating
对象存储几个问题的答案,这些问题是 1-5 范围内的整数评分(示例问题:Product
的质量如何,如何是 Product
的价值,Product
对用户有多友好).为简化起见,假设每个 User
对他们购买的每个 Product
进行评分.
The Rating
object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product
, how is the value of the Product
, how user-friendly is the Product
). For simplification assume every User
rates every Product
they buy.
现在这是我要执行的计算:对于 User
,计算他们购买的所有 Product
的平均评分(即平均评分来自所有其他 User
,其中一个将来自这个 User
自己).然后我们可以告诉用户平均而言,您购买的产品被所有购买该产品的客户评为价值 3/5".
Now here is the calculation I want to perform: For a User
, calculate the average rating of all the Product
s they have bought (that is, the average rating from all other Users
, one of which will be from this User
themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
简单而缓慢的方法就是迭代用户的所有评论对象.如果我们假设每个用户都购买了少量(
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
但是,我也可以执行以下操作:在 Product
类上,保留选择每个数字的 Rating
数量的计数器(例如,有多少 用户
s 对该产品的价值评价为 3/5).如果每次对 Product
进行评分时都增加该计数器,那么计算给定 Product
的平均值只需要检查每个 Rating
的 5 个计数器标准.
However, I could also do the following: On the Product
class, keep a counter of the number of Rating
s that selected each number (e.g. how many User
s rated this product 3/5 for value). If you increment that counter every time a Product
is rated, then computing the average for a given Product
just requires checking the 5 counters for each Rating
criteria.
这是一种有效的技术吗?它是否普遍使用/是否有名称?对我来说似乎很直观,但我对数据库的了解不够,无法判断是否存在一些根本性缺陷.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
推荐答案
这是正常的.它最终是缓存:对状态进行冗余编码,以牺牲其他使用模式为代价使某些使用模式受益.当然,这也是一种复杂化.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
仅仅因为 RDBMS 数据结构是关系并不意味着您不能从某种简单的形式重新安排编码状态的方式.例如非规范化.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(有时冗余设计(包括像您这样的设计)在它们实际上不是非规范化的结果并且冗余不是非规范化导致或规范化消除的那种时被称为非规范化".SQL 数据库中的跨表依赖/约束 事实上,可以合理地将您的情况描述为涉及规范化而不保留 FD(函数依赖).从一个带有用户 id
和其他列,它们的 ratings
(一个关系)和它的 counter
.然后 ratings
在功能上决定 counter
因为 counter
= select count(*) from ratings
. 分解为 user
etc + counter
, 即表 User
, 和 user
+ ratings
, 取消分组到表 Rating
.)
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's id
& other columns, their ratings
(a relation) & its counter
. Then ratings
functionally determines counter
since counter
= select count(*) from ratings
. Decompose to user
etc + counter
, ie table User
, and user
+ ratings
, which ungroups to table Rating
. )
你对谷歌搜索时使用的最佳术语有什么建议
我的一个经常评论:谷歌很多清晰、简洁和您的问题/问题/目标/需要的特定措辞,以及各种术语子集你可能会发现它们的标签 &没有您的特定名称(变量/数据库/表/列/约束/等).例如我什么时候可以在数据库中冗余存储(总和或总计)".人工措辞,而不仅仅是关键词,似乎有帮助.您最好的选择可能是优化 SQL 数据库设计以提高性能.有整本书('amazon isbn'),一些在线('pdf').(但可能主要是重新查询).研究与仓储相关的技术,因为 OLTP 数据库充当 OLAP 数据库的输入缓冲区,并使用 SQL 处理大数据.(例如快照调度.)
A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS 我称之为缓存"(标签 caching 也是如此) 是(我的典型)相当抽象,以至于有严肃的笑话说 CS 中的一切都在缓存.(谷歌搜索......计算机科学中只有两个难题:缓存失效和命名事物."--Phil Karlton.)(欢迎两者.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)
这篇关于我可以在数据库多对多字段中使用计数器来减少查找吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!