甚至不确定用于提出这个问题的正确术语,但我们开始了。

我有一个集合,我正在使用 MapReduce 来执行聚合任务。我不能使用聚合管道,因为我需要在减少的同时执行自定义代码。

为了使问题更清楚,这略有简化。

  • 我有一个集合,其中每个文档都包含一个位置(即网格单元 ID)和一个时间片(由该时间片开始时的时间戳表示),并包含诸如“汽车数量”等信息;每个位置可能有数千个这样的文档,每个时间段也可能有几个。
  • 此外,对于每个位置,都可以有“时间片”属性为空的文档。这包含有关静态功能等的信息:即没有与之关联的时间戳的数据。

  • 我想要做的是运行一个 map-reduce 过程,其中输出文档由位置 ID 和时间片作为键,并且至关重要的是,我能够将不定时数据与定时数据合并。

    这是一些示例输入(在数据方面非常简化,但 cell_idtimeslice 值正是我必须使用的值):
    [
      {
        "cell_id": 100,
        "timeslice": "2019-03-20T00:00:00.000Z",
        "num_vehicles": 5,
        "num_residential_units": null,
        "num_commercial_units": null
      },
      {
        "cell_id": 100,
        "timeslice": "2019-03-20T00:00:00.000Z",
        "num_vehicles": 4,
        "num_residential_units": null,
        "num_commercial_units": null
      },
      {
        "cell_id": 100,
        "timeslice": "2019-03-20T00:00:00.000Z",
        "num_vehicles": 1,
        "num_residential_units": null,
        "num_commercial_units": null
      },
      {
        "cell_id": 100,
        "timeslice": "2019-03-21T00:00:00.000Z",
        "num_vehicles": 7,
        "num_residential_units": null,
        "num_commercial_units": null
      },
      {
        "cell_id": 100,
        "timeslice": "2019-03-21T00:00:00.000Z",
        "num_vehicles": 2,
        "num_residential_units": null,
        "num_commercial_units": null
      },
      {
        "cell_id": 100,
        "timeslice": null,
        "num_vehicles": null,
        "num_residential_units": 30,
        "num_commercial_units": 12
      },
      {
        "cell_id": 101,
        "timeslice": "2019-03-20T00:00:00.000Z",
        "num_vehicles": 5,
        "num_residential_units": null,
        "num_commercial_units": null
      },
      {
        "cell_id": 101,
        "timeslice": "2019-03-21T00:00:00.000Z",
        "num_vehicles": 1,
        "num_residential_units": null,
        "num_commercial_units": null
      },
      {
        "cell_id": 101,
        "timeslice": "2019-03-21T00:00:00.000Z",
        "num_vehicles": 2,
        "num_residential_units": null,
        "num_commercial_units": null
      },
      {
        "cell_id": 101,
        "timeslice": "2019-03-21T00:00:00.000Z",
        "num_vehicles": 1,
        "num_residential_units": null,
        "num_commercial_units": null
      },
      {
        "cell_id": 101,
        "timeslice": "2019-03-21T00:00:00.000Z",
        "num_vehicles": 0,
        "num_residential_units": null,
        "num_commercial_units": null
      },
      {
        "cell_id": 101,
        "timeslice": null,
        "num_vehicles": null,
        "num_residential_units": 8,
        "num_commercial_units": 1
      },
      {
        "cell_id": 100,
        "timeslice": "2019-03-20T00:00:00.000Z",
        "num_vehicles": 10,
        "num_residential_units": 30,
        "num_commercial_units": 12
      },
      {
        "cell_id": 100,
        "timeslice": "2019-03-21T00:00:00.000Z",
        "num_vehicles": 9,
        "num_residential_units": 30,
        "num_commercial_units": 12
      },
      {
        "cell_id": 101,
        "timeslice": "2019-03-20T00:00:00.000Z",
        "num_vehicles": 5,
        "num_residential_units": 8,
        "num_commercial_units": 1
      },
      {
        "cell_id": 101,
        "timeslice": "2019-03-21T00:00:00.000Z",
        "num_vehicles": 4,
        "num_residential_units": 8,
        "num_commercial_units": 1
      }
    ]
    

    ...以及我希望该输入产生的输出(我没有将其拆分为 _idvalue ,但本质上 cell_idtimeslice 将是 _id :
    [
      {
        "cell_id": 100,
        "timeslice": null,
        "num_vehicles": null,
        "num_residential_units": 30,
        "num_commercial_units": 12
      },
      {
        "cell_id": 100,
        "timeslice": "2019-03-20T00:00:00.000Z",
        "num_vehicles": 10,
        "num_residential_units": 30,
        "num_commercial_units": 12
      },
      {
        "cell_id": 100,
        "timeslice": "2019-03-21T00:00:00.000Z",
        "num_vehicles": 9,
        "num_residential_units": 30,
        "num_commercial_units": 12
      },
      {
        "cell_id": 101,
        "timeslice": null,
        "num_vehicles": null,
        "num_residential_units": 8,
        "num_commercial_units": 1
      },
      {
        "cell_id": 101,
        "timeslice": "2019-03-20T00:00:00.000Z",
        "num_vehicles": 5,
        "num_residential_units": 8,
        "num_commercial_units": 1
      },
      {
        "cell_id": 101,
        "timeslice": "2019-03-21T00:00:00.000Z",
        "num_vehicles": 4,
        "num_residential_units": 8,
        "num_commercial_units": 1
      }
    ]
    

    如果 Emit 阶段按位置和时间对发出的文档进行键控,那么我将所有定时数据正确放入 reduce 函数中,并且我自己减少了未定时数据......但我需要以某种方式也将其合并不定时数据到每个减少的定时数据文档中。在 Finalize 阶段是否有某种方法可以做到这一点,或者是否有一些巧妙的方法来设置 key ......?我难住了。坦率地说,解决方案是否涉及 map-reduce 对我来说并不重要,但它必须在有限的硬件上大规模高效。

    最佳答案

    你可以尝试这样的事情。

    下面的查询将抓取非空时间戳行,然后是组以获取聚合值。在您拥有聚合文档后,您将重新加入同一个集合以拉入未定时的行。

    db.collection.aggregate([
      {"$match":{"timeslice":{"$ne":null}}},
      {"$group":{
          "_id":{"cell_id":"$cell_id","timeslice":"$timeslice"},
          "num_vehicles":{"$sum":"$num_vehicles"}
      }},
      {"$lookup":{
         "from":"collection",
         "localField":"_id.cell_id",
         "foreignField":"cell_id",
         "as":"untimed_doc"
       }},
      {"$unwind":"$untimed_doc"},
      {"$match":{"untimed_doc.timeslice":{"$eq":null}}}
    ])
    

    关于MongoDB Map-Reduce : One document that needs to be incorporated into all others matching a condition?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/60269664/

    10-12 12:18
    查看更多