问题描述
我目前正在准备面试,这让我想起了我在之前的面试中被问到的一个问题,内容如下:
I'm currently preparing for an interview, and it reminded me of a question I was once asked in a previous interview that went something like this:
您被要求设计一些软件,以连续显示 Google 上的前 10 个搜索词.您可以访问一个提要,该提要提供当前在 Google 上搜索的搜索词的无限实时流.请描述什么算法和用于实现此目的的数据结构.您将设计两种变体:
"You have been asked to design some software to continuously display the top 10 search terms on Google. You are given access to a feed that provides an endless real-time stream of search terms currently being searched on Google. Describe what algorithm and data structures you would use to implement this. You are to design two variations:
(i) 显示所有时间的前 10 个搜索词(即自您开始阅读提要以来).
(i) Display the top 10 search terms of all time (i.e. since you started reading the feed).
(ii) 仅显示过去一个月的前 10 个搜索词,每小时更新一次.
(ii) Display only the top 10 search terms for the past month, updated hourly.
您可以使用近似值来获得前 10 名的列表,但您必须证明您的选择是合理的."
我在这次采访中轰炸了,但仍然不知道如何实现这一点.
You can use an approximation to obtain the top 10 list, but you must justify your choices."
I bombed in this interview and still have really no idea how to implement this.
第一部分要求在无限列表的不断增长的子序列中出现 10 个最频繁的项目.我研究了选择算法,但找不到任何在线版本来解决这个问题.
The first part asks for the 10 most frequent items in a continuously growing sub-sequence of an infinite list. I looked into selection algorithms, but couldn't find any online versions to solve this problem.
第二部分使用了一个有限列表,但由于处理的数据量很大,你无法真正将整个月的搜索词存储在内存中并每小时计算一个直方图.
The second part uses a finite list, but due to the large amount of data being processed, you can't really store the whole month of search terms in memory and calculate a histogram every hour.
由于前 10 名列表不断更新,所以问题变得更加困难,因此您需要以某种方式通过滑动窗口计算前 10 名.
The problem is made more difficult by the fact that the top 10 list is being continuously updated, so somehow you need to be calculating your top 10 over a sliding window.
有什么想法吗?
推荐答案
好吧,看起来数据量很大,存储所有频率的成本可能高得惊人.当数据量太大以至于我们无法将其全部存储起来时,我们就进入了数据流算法的领域.
Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms.
这方面的有用书籍:Muthukrishnan - 数据流:算法和应用程序"
与我从上面挑选的手头问题密切相关的参考:Manku, Motwani -近似频率对数据流进行计数" [pdf]
Closely related reference to the problem at hand which I picked from the above:Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf]
顺便说一下,斯坦福大学的 Motwani(编辑)是非常重要的"随机算法" 一书.. 抱歉,参考文献不好,该特定章节针对的是不同的问题.检查后,我改为推荐 Muthukrishnan 的书 的第 5.1.2 节,可在线获取.
By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. . Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.
嘿,不错的面试问题.
这篇关于查找前 10 个搜索词的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!