问题描述
我的用例如下-我在mongoDB中收集了一些文档,必须将其发送以进行分析.文件格式如下-
My use case is as follows -I have a collection of documents in mongoDB which I have to send for analysis.The format of the documents are as follows -
{_id:ObjectId("517e769164702dacea7c40d8"),日期:"1359911127494",状态:可用",other_fields ...}
{ _id:ObjectId("517e769164702dacea7c40d8") ,date:"1359911127494",status:"available",other_fields... }
我有一个阅读器进程,该进程将根据 date 排序的前100个具有 status:available 的文档,并使用 status:processing 对其进行修改.ReaderProcess发送文档进行分析.分析完成后,状态将更改为已处理.
I have a reader process which picks first 100 documents with status:available sorted by date and modifies them with status:processing.ReaderProcess sends the documents for analysis. Once the analysis is complete the status is changed to processed.
当前,读取器进程首先获取按 date 排序的100个文档,然后为每个文档循环循环将状态更新为处理.对于这种情况,有没有更好/有效的解决方案?
Currently reader process first fetch 100 documents sorted by date and then update the status to processing for each document in a loop. Is there any better/efficient solution for this case?
此外,在将来为了实现可伸缩性,我们可能会使用多个阅读器流程.在这种情况下,我希望一个阅读器进程选择的100个文档不应被另一个阅读器进程选择.但是现在获取和更新是单独的查询,因此很可能多个阅读器进程会选择相同的文档.
Also, in future for scalability, we might go with more than one reader process.In this case, I want that 100 documents picked by one reader process should not get picked by another reader process. But fetching and updating are seperate queries right now, so it is very much possible that multiple reader processes pick same documents.
批量 findAndModify (有限制)将解决所有这些问题.但是很遗憾,MongoDB尚未提供它.这个问题有什么解决办法吗?
Bulk findAndModify (with limit) would have solved all these problems. But unfortunately it is not provided in MongoDB yet. Is there any solution to this problem?
推荐答案
正如您所提到的,目前尚无干净的方法来完成您想要的事情.目前,针对所需操作的最佳方法是:
As you mention there is currently no clean way to do what you want. The best approach at this time for operations like the one you need is this :
- 阅读器选择具有适当限制和排序的X个文档
- 阅读器用其自己的唯一阅读器ID(
e.g. update({_id:{$in:[<result set ids>]}, state:"available", $isolated:1}, {$set:{readerId:<your reader's ID>, state:"processing"}}, false, true)
)标记1)返回的文档 - 阅读器选择所有标记为正在处理且具有其自己的阅读器ID的文档.此时,可以确保您拥有对生成的文档集的独占访问权.
- 提供3)中的结果集以供您处理.
- Reader selects X documents with appropriate limit and sorting
- Reader marks the documents returned by 1) with it's own unique reader ID (
e.g. update({_id:{$in:[<result set ids>]}, state:"available", $isolated:1}, {$set:{readerId:<your reader's ID>, state:"processing"}}, false, true)
) - Reader selects all documents marked as processing and with it's own reader ID. At this point it is guaranteed that you have exclusive access to the resulting set of documents.
- Offer the resultset from 3) for your processing.
请注意,这甚至可以在高度并发的情况下工作,因为阅读器永远无法保留其他阅读器尚未保留的文档(请注意,步骤2只能保留当前可用的文档,并且写入是原子的).如果您想使预订超时(例如,对于读者可能崩溃/失败的情况),我也会添加带有预订时间的时间戳.
Note that this even works in highly concurrent situations as a reader can never reserve documents not already reserved by another reader (note that step 2 can only reserve currently available documents, and writes are atomic). I would add a timestamp with reservation time as well if you want to be able to time out reservations (for example for scenarios where readers might crash/fail).
更多详细信息:
如果写入花费相对较长的时间,则所有写入操作有时都会产生挂起的操作.这意味着,除非您执行以下步骤,否则步骤2)可能不会看到步骤1)标记的所有文档:
All write operations can occasionally yield for pending operations if the write takes a relatively long time. This means that step 2) might not see all documents marked by step 1) unless you take the following steps :
- 使用适当的"w"(写关注)值,表示1或更高.这样可以确保调用写入操作的连接将等待它完成,而不管它产生了什么.
- 确保对同一连接(仅与启用了slaveOk的读取的副本集相关)或线程进行第2步中的读取,以确保它们是顺序的.前者可以在大多数驱动程序中使用"requestStart"和"requestDone"方法或类似方法来完成(Java文档此处).
- 将$ isolated标志添加到您的多次更新中,以确保它不能与其他写入操作交错.
- Use an appropriate "w" (write concern) value, meaning 1 or higher. This will ensure that the connection on which the write operation is invoked will wait for it to complete regardless of it yielding.
- Make sure you do the read in step 2 on the same connection (only relevant for replicasets with slaveOk enabled reads) or thread so that they are guaranteed to be sequential. The former can be done in most drivers with the "requestStart" and "requestDone" methods or similar (Java documentation here).
- Add the $isolated flag to your multi-updates to ensure it cannot be interleaved with other write operations.
另请参阅注释以了解有关原子性/隔离性的讨论.我错误地认为多个更新是隔离的.它们不是,或者至少在默认情况下不是.
Also see comments for discussion regarding atomicity/isolation. I incorrectly assumed multi-updates were isolated. They are not, or at least not by default.
这篇关于在MongoDB中批量FindAndModify的解决方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!