问题描述
假设我有一个名为Scrape的数据库表,可能的设置如下:
UserID(int)
UserName(varchar)
Wins(int)
Losses(int)
ScrapeDate(datetime)
我想根据他们的胜/亏比率对我的用户排名。但是,每个星期我都会为用户抓取新数据,并在抓取表中创建另一个条目。
如何查询按胜利/亏损排序的用户列表,但只考虑最近的条目(ScrapeDate)?
此外,你认为重要的是,人们会击中网站,并且可能正在完成中间的刮擦。
例如我可以:
1 - Bob - Wins:320 - Losses:110 - ScrapeDate:7/8/09
1 - Bob - 胜利:360 - 亏损:122 - ScrapeDate:7/17/09
2 - Frank - 胜利:115 - 亏损:20 - 刮刮日期:7/8/09
其中,这代表一个只更新Bob到目前为止的scrape,并且正在更新Frank,但尚未插入。
因此,我的问题是:
- 您如何处理只查询每个用户的最近刮擦以确定排名
- 您认为数据库可能处于更新的状态(尤其是如果刮擦最多可能需要1天才能完成),并不是所有的用户都完全更新了吗?
- How would you handle querying only the most recent scrape of each user to determine the rankings
- Do you think the fact that the database may be in a state of updating (especially if a scrape could take up to 1 day to complete), and not all users have completely updated yet matters? If so, how would you handle this?
谢谢您,并感谢您对我的相关问题的回答:
这就是我所说的每组最大n问题。
我使用外连接技术解决了这种类型的问题:
SELECT s1。*,s1.wins / s1.losses AS win_loss_ratio
FROM scrape s1
LEFT OUTER JOIN Scrape s2
ON(s1.username = s2.username AND s1.ScrapeDate< s2.ScrapeDate)
WHERE s2.username IS NULL
ORDER BY win_loss_ratio DESC;
这将为每个用户名返回一行 - ScrapeDate
列。这是外连接的作用, 匹配 s1
与其他行 s2
使用相同的用户名和更大的日期。如果没有这样的行,对于 s2
的所有列,外连接返回NULL,然后我们知道 s1
在您完成部分已完成的抓取操作时,此操作也应该有效。
这种技术不一定像CTE和RANKING解决方案的其他答案一样快。你应该尝试两个,看看什么对你更好。我喜欢我的解决方案的原因是它适用于任何风味的SQL。
Lets say I have a database table called "Scrape" possibly setup like:
UserID (int)
UserName (varchar)
Wins (int)
Losses (int)
ScrapeDate (datetime)
I'm trying to be able to rank my users based on their Wins/Loss ratio. However, each week I'll be scraping for new data on the users and making another entry in the Scrape table.
How can I query a list of users sorted by wins/losses, but only taking into consideration the most recent entry (ScrapeDate)?
Also, do you think it matters that people will be hitting the site and the scrape may possibly be in the middle of completing?
For example I could have:
1 - Bob - Wins: 320 - Losses: 110 - ScrapeDate: 7/8/09
1 - Bob - Wins: 360 - Losses: 122 - ScrapeDate: 7/17/09
2 - Frank - Wins: 115 - Losses: 20 - ScrapeDate: 7/8/09
Where, this represents a scrape that has only updated Bob so far, and is in the process of updating Frank but has yet to be inserted. How would you handle this situation as well?
So, my question is:
Thank you, and thank you for your responses you have given me on my related question:
This is what I call the "greatest-n-per-group" problem. It comes up several times per week on StackOverflow.
I solve this type of problem using an outer join technique:
SELECT s1.*, s1.wins / s1.losses AS win_loss_ratio
FROM Scrape s1
LEFT OUTER JOIN Scrape s2
ON (s1.username = s2.username AND s1.ScrapeDate < s2.ScrapeDate)
WHERE s2.username IS NULL
ORDER BY win_loss_ratio DESC;
This will return only one row for each username -- the row with the greatest value in the ScrapeDate
column. That's what the outer join is for, to try to match s1
with some other row s2
with the same username and a greater date. If there is no such row, the outer join returns NULL for all columns of s2
, and then we know s1
corresponds to the row with the greatest date for that given username.
This should also work when you have a partially-completed scrape in progress.
This technique isn't necessarily as speedy as the CTE and RANKING solutions other answers have given. You should try both and see what works better for you. The reason I prefer my solution is that it works in any flavor of SQL.
这篇关于如何查询我的数据库中的用户的排名,但只考虑每个用户的最新条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!