问题描述
我的一个朋友被要求在接受记者采访时以下问题。谁能告诉我怎么解决呢?
One of my friends was asked the following question in an interview. Can anyone tell me how to solve it?
我们有一个相当大的日志文件,大约5GB。日志文件的每一行包含了一个用户访问我们的网站的网址。我们要搞清楚什么是最流行的100个URL访问过我们的用户。怎么办呢?
We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. How to do it?
推荐答案
在情况下,我们比10GB内存有更多的,只是做简单的用HashMap中。
In case we have more than 10GB RAM, just do it straight forward with hashmap.
否则,把它分割成多个文件,使用散列函数。然后处理每个文件,并获得前5名以前5S为每个文件,这将是很容易得到一个整体的前5名。
Otherwise, separate it into several files, using a hash function. And then process each file and get a top 5. With "top 5"s for each file, it will be easy to get an overall top 5.
另一种解决方案可以排序它使用任何外部排序方法。然后扫描文件一次计数每次出现。在这个过程中,你没有保持跟踪计数。您可以放心地扔东西并不能使进前五名了。
Another solution can be sort it using any external sorting method. And then scan the file once to count each occurrence. In the process, you don't have to keep track of the counts. You can safely throw anything that doesn't make into top5 away.
这篇关于获得前100名的URL从日志文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!