javascript - 过滤 Node/JavaScript/下划线中的大型数据集

我在私有云中有大量的虚拟机使用记录记录。每小时，都会为我的云中运行的每个VM生成一组记录。所有VM都有一条记录，其中包含诸如RAM，内存之类的规格，并具有一个ID：字段，与使用记录中的virtualmachineid相对应。我可以查询API并获取XML或JSON数据集，我选择了JSON，因为它在网络上更加轻巧。

这是一条记录，有13种类型的用法与磁盘使用率，带宽，运行时间等相对应：

Usage Record:
  { account: 'user_1',
    accountid: 'c22ed7ed-e51a-4782-83a7-c72e2883eb99',
    domainid: 'f88d8bbf-a83b-4be1-a788-e2ab51eb9973',
    zoneid: '4a7f62a8-3248-47ee-bf94-d63dac2a6668',
    description: 'VM2 running time (ServiceOffering: 11) (Template: 215)',
    usage: '1 Hrs',
    usagetype: 1,
    rawusage: '1',
    virtualmachineid: 'f6661f34-4d03-4128-b738-38c330f2499c',
    name: 'VM2',
    offeringid: 'f1d82c2e-25e3-4c97-bae8-b6f916860faf',
    templateid: '2bf2e295-fdd6-4326-a652-6d07581be070',
    usageid: 'f6661f34-4d03-4128-b738-38c330f2499c',
    type: 'XenServer',
    startdate: '2012-12-25\'T\'22:00:00-06:00',
    enddate: '2012-12-25\'T\'22:59:59-06:00' }

我正在尝试做的是：

我需要遍历VM的列表（其中将有数百个），并为每个VM建立前一个周期（通常是一个月）的使用情况报告，但也可以是临时的。因此，每个月每个VM的10000多个使用记录中，我需要计算每种使用类型的总计。

有没有比传统的循环-循环-循环再循环-循环方法更有效，新颖的方法？
用伪代码：

for (each vm in vms)
    for (each usage_record in usage_records)
        if (vm.id === usage_record.vmid)
            switch usage_record.usage_type
                case 1: its runtime
                case 2: its disk usage
                case 3: its some other type of usage
                ...

使用下划线，这是我到目前为止所做的：

_.each(virtualMachines.virtualmachine, function (vm) {
    var recs = _.filter(usageList.usagerecord, function (foo) {
        return (foo.virtualmachineid === vm.id && foo.usagetype === 1);
    });
        console.log("recs count:" + recs.count);
        //now, recs contains all the usage record type 1 for one VM

 });

现在可以正常工作，但是我不相信它已经过优化，并且不会随着VM数量的增加而扩展。对于每个VM，将向数据集添加10,000个其他使用记录。

最佳答案

由于您需要处理每个VM并需要每个VM的结果，因此我首先要按VM排序列表。之后，您只需为“当前VM统计信息”提供一个循环和一个对象。一旦遇到列表中的下一个VM，就知道当前的统计信息已完成。

sortRecordsByVM();
currentStats = { runtime: 0, disk: 0, other: 0 };
currentVM;
for each record
  if currentVM != record.VM
    writeToOutput(currenStats);
    currentStats = { runtime: 0, disk: 0, other: 0 };
  addRecordTo(record, currentStats);
writeToOutput(currenStats);

就是说：我不认为迭代超过10K的记录不会对现代机器造成问题，所以我将从最简单的方法开始，仅在出现性能问题时进行优化。

我只是不使用嵌套循环，而是将查询留给内置数据结构（通常比我倾向于首先尝试编写的任何代码都更优化）：

allStats = {};
for each record
  stats = allStats[record.VM];
  if (!stats)
    stats = { runtime: 0, disk: 0, other: 0 };
  addRecordTo(record, stats);
  allStats[record.VM] = stats;