使用 MongoDB 的 MapReduce 非常非常慢(对于等效数据库，在 MySQL 中 30 小时 vs 20 分钟)

本文介绍了使用 MongoDB 的 MapReduce 非常非常慢(对于等效数据库，在 MySQL 中 30 小时 vs 20 分钟)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我现在正在做一些数据分析测试，首先，非常简单，我得到了非常奇怪的结果.

I am doing now some data analyse tests and in the first, really simple I have got very strange results.

想法如下:来自互联网访问日志(每次访问包含一个文档的集合，用于测试 9000 万个文档).我想按域获取访问次数(在 MySQL 中将是 GROUP BY)，并获取访问次数最多的 10 个域

The idea is the following: from an internet access log (a collection with a document for each access, for the tests 90 millions of documents). I want to get the number of access by domain (what will be a GROUP BY in MySQL), and get the 10 most accessed domains

我用 JavaScript 制作的脚本非常简单:

The script I have made in JavaScript is really simple :

/* Counts each domain url */
m = function () {
    emit(this.domain, 1 );
}

r = function (key, values)    {
    total = 0;
    for (var i in values)    {
        total += Number(i);
    }

    return total;
}

/* Store of visits per domain statistics on NonFTP_Access_log_domain_visits collection */
res = db.NonFTP_Access_log.mapReduce(m, r, { out: { replace : "NonFTP_Access_log_domain_visits" } } );
db.NonFTP_Access_log_domain_visits.ensureIndex({ "value": 1});
db.NonFTP_Access_log_domain_visits.find({}).sort({ "value":-1 }).limit(10).forEach(printjson);

MySQL 中的等价物是:

The equivalent in MySQL is :

drop table if exists NonFTP_Access_log_domain_visits;
create table NonFTP_Access_log_domain_visits (
    `domain` varchar(255) NOT NULL,
    `value` int unsigned not null,
    PRIMARY KEY  (`domain`),
    KEY `value_index` (`value`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8
    select domain, count(*) as value from NonFTP_Access_log group by domain;
select * from NonFTP_Access_log_domain_visits order by value desc limit 10;

嗯，MongoDB 需要 30 小时才能得到结果，而 MySQL 需要 20 分钟！ 读了一点之后，我得出的结论是，对于数据分析，我们必须使用 Hadoop，因为 MongoDB 真的很慢.像这样的问题的答案是这样的:

Well, MongoDB takes 30 hours to get the results and MySQL 20 minutes! After reading a little I have arrived to the conclusion that for data analyse we will have to use Hadoop as MongoDB is really slow. The answers to questions like this say that:

MongoDB 只使用线程
Javascript 太慢了

我做错了什么?这个结果正常吗?我应该使用 Hadoop 吗?

我们正在以下环境中进行此测试:

We are making this test on the following environment:

操作系统:Suse Linux Enterprise Server 10(Xen 上的虚拟服务器)
内存:10 Gb
核心数:32(AMD Opteron 处理器 6128)

对于等效数据库

使用 MongoDB 的 MapReduce 非常非常慢(对于等效数据库，在 MySQL 中 30 小时 vs 20 分钟)

问题描述

推荐答案