问题描述
我在mongodb中有以下时间序列数据:
I have time series data in mongodb as follows:
{
"_id" : ObjectId("558912b845cea070a982d894"),
"code" : "ZL0KOP",
"time" : NumberLong("1420128024000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d895"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128025000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d896"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128003000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d897"),
"code" : "ZL0KOP",
"time" : NumberLong("1420041724000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d89e"),
"code" : "YBUHCW",
"time" : NumberLong("1420041732000"),
"direction" : "10",
"siteId" : "0002"
}
{
"_id" : ObjectId("558912b845cea070a982d8a1"),
"code" : "U48AIW",
"time" : NumberLong("1420041729000"),
"direction" : "10",
"siteId" : "0002"
}
{
"_id" : ObjectId("558912b845cea070a982d8a0"),
"code" : "OJ3A06",
"time" : NumberLong("1420300927000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d89d"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420300885000"),
"direction" : "10",
"siteId" : "0003"
}
{
"_id" : ObjectId("558912b845cea070a982d8a2"),
"code" : "ZLV05H",
"time" : NumberLong("1420300922000"),
"direction" : "10",
"siteId" : "0001"
}
{
"_id" : ObjectId("558912b845cea070a982d8a3"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420300928000"),
"direction" : "10",
"siteId" : "0000"
}
与两个或多个条件匹配的代码需要过滤掉.例如:
The codes that match two or more conditions need to be filtered out.For example:
condition1: 1420128000000 < time < 1420128030000,siteId == 0000
condition2: 1420300880000 < time < 1420300890000,siteId == 0003
第一个条件的结果:
{
"_id" : ObjectId("558912b845cea070a982d894"),
"code" : "ZL0KOP",
"time" : NumberLong("1420128024000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d895"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128025000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d896"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128003000"),
"direction" : "10",
"siteId" : "0000"
}
第二个条件的结果:
{
"_id" : ObjectId("558912b845cea070a982d89d"),
"code" : "AQ0ZSQ", "time" : NumberLong("1420300885000"),
"direction" : "10",
"siteId" : "0003"
}
唯一符合以上所有条件的代码应该是:
The only code that matchs all the conditions above should be:
{"code" : "AQ0ZSQ", "count":2}
计数"是指在两种情况下均出现代码"AQ0ZSQ"
"count" means, the code "AQ0ZSQ" appeared in both conditions
我能想到的唯一解决方案是使用两个查询.例如,使用python
The only solution I can think of is using two querys. For example, using python
result1 = list(db.codes.objects({'time': {'$gt': 1420128000000,'$lt': 1420128030000}, 'siteId': "0000"}).only("code"))
result2 = list(db.codes.objects({'time': {'$gt': 1420300880000,'$lt': 1420300890000}},{'siteId':'0003'}).only("code"))
,然后在两个结果中找到共享代码.
and then found the shared code in both results.
问题在于集合中有数百万个文档,并且两个查询都可以轻松超过16mb的限制.
The Problem is that there are millions of documents in the collection, and both query can easily exceed the 16mb limitation.
那么有可能在一个查询中做到这一点吗?还是应该更改文档结构?
So is it possible to do that in one query? or should I change the document structure?
推荐答案
您在这里要使用的内容需要使用聚合框架以计算服务器上结果之间有交集.
What you are asking for here requires the usage of the aggregation framework in order to calculate that there was an intersection between results on the server.
逻辑的第一部分是您需要 查询这两个条件,然后会对这些结果进行一些额外的投影和过滤:
The first part of the logic is you need an $or
query for the two conditions, then there will be some additional projection and filtering on those results:
db.collection.aggregate([
// Fetch all possible documents for consideration
{ "$match": {
"$or": [
{
"time": { "$gt": 1420128000000, "$lt": 1420128030000 },
"siteId": "0000"
},
{
"time": { "$gt": 1420300880000, "$lt": 1420300890000 },
"siteId": "0003"
}
]
}},
// Locigically compare the conditions agaist results and add a score
{ "$project": {
"code": "$code",
"score": { "$add": [
{ "$cond": [
{ "$and":[
{ "$gt": [ "$time", 1420128000000 ] },
{ "$lt": [ "$time", 1420128030000 ] },
{ "$eq": [ "$siteId", "0000" ] }
]},
1,
0
]},
{ "$cond": [
{ "$and":[
{ "$gt": [ "$time", 1420300880000 ] },
{ "$lt": [ "$time", 1420300890000 ] },
{ "$eq": [ "$siteId", "0003" ] }
]},
1,
0
]}
]}
}},
// Now Group the results by "code"
{ "$group": {
"_id": "$code",
"score": { "$sum": "$score" }
}},
// Now filter to keep only results with score 2
{ "$match": { "score": 2 } }
])
因此,请分解并查看其工作原理.
So break that down and see how it works.
首先,您要使用 $match
进行查询> 获取交集"条件的所有"所有可能的文档.这是$or
表达式在此处所允许的,因为考虑到匹配的文档必须满足任何一组.您需要所有人都在这里计算出交叉点".
First you want a query with $match
to get all the possible documents for "all" of your conditions of "intersection". That is what the $or
expression allows here by considering that matched documents must meet either set. You need all of them to work out the "intersection" here.
在第二个 $project
管道阶段对每个集合执行条件的布尔测试.请注意 $and
的用法此处以及聚合框架的其他布尔运算符略有不同与查询用法表格中的内容相同.
In the second $project
pipeline stage a boolean test of your conditions is performed with each set. Notice the usage of $and
here as well as other boolean operators of the aggregation framework is slightly different to that of the query usage form.
在聚合框架形式中(使用常规查询运算符的$match
之外),这些运算符采用参数数组,通常表示两个"值以进行比较,而不是将操作分配给变量的右侧"字段名称.
In the aggregation framework form ( outside of $match
which uses normal query operators ) these operators take an array of arguments, to typically represent "two" values for comparison rather than the operation being assigned to the "right" of the field name.
由于这些条件是逻辑条件或布尔值",我们希望将结果返回为数字"而不是true/false
值.这就是 $cond
的作用这里.因此,如果条件为被检查文档为真,则发出1
得分,否则为0
,否则为0
.
Since these conditions are logical or "boolean" we want to return the result as "numeric" rather than a true/false
value. This is what $cond
does here. So where the condition is true for the document inspected a score of 1
is emitted otherwise it is 0
when false.
最后在此$project
表达式中,两个条件都用 $add
来形成得分"结果.因此,如果所有条件(在$ match之后都不可能)都不为真,则分数将为0,如果"one"为真,则为1,或者如果"both"为真,则为2.
Finally in this $project
expression both of your conditions are wrapped with $add
to form the "score" result. So if none of the conditions ( not possible after the $match ) were not true the score would be 0, if "one" is true then 1, or where "both" are true then 2.
在此注意,此处要求的特定条件对于单个文档都不会得分在1
以上,因为没有文档可以具有此处所示的重叠范围或两个""siteId"值.
Noting here that the specific conditions asked for here will never score above 1
for a single document since no document can have the overlapping range or "two" "siteId" values as is present here.
现在重要的部分是 $group
> 按代码"值和 $sum
得分值,以获取每个代码"的总和.
Now the important part is to $group
by the "code" value and $sum
the score value to get a total per "code".
这将使管道的最后一个$match
过滤器阶段仅保留那些分数"值等于您要求的条件数的文档.在这种情况下2
.
This leaves the final $match
filter stage of the pipeline to only keep those documents with a "score" value that is equal to the number of conditions you asked for. In this case 2
.
但是,这里有一个失败之处,即在任一条件的匹配中存在多个"code"值(如存在),那么此处的得分"将是不正确的.
There is a failing there however in that where there is more than one value of "code" in the matches for either condition ( as there is ) then the "score" here would be incorrect.
因此,在介绍了在聚合中使用逻辑运算符的原理之后,您可以通过在逻辑上标记"每个结果以适用于设置"的条件来解决该故障.然后,在这种情况下,您基本上可以考虑哪个代码"出现在两个"集中:
So after the introduction to the principles of using logical operators in aggregation, you can fix that fault by essentially "tagging" each result logically as to which condition "set" it applies to. Then you can basically consider which "code" appeared in "both" sets in this case:
db.collection.aggregate([
{ "$match": {
"$or": [
{
"time": { "$gt": 1420128000000, "$lt": 1420128030000 },
"siteId": "0000"
},
{
"time": { "$gt": 1420300880000, "$lt": 1420300890000 },
"siteId": "0003"
}
]
}},
// If it's the first logical condition it's "A" otherwise it can
// only be the other, therefore "B". Extend for more sets as needed.
{ "$group": {
"_id": {
"code": "$code",
"type": { "$cond": [
{ "$and":[
{ "$gt": [ "$time", 1420128000000 ] },
{ "$lt": [ "$time", 1420128030000 ] },
{ "$eq": [ "$siteId", "0000" ] }
]},
"A",
"B"
]}
}
}},
// Simply add up the results for each "type"
{ "$group": {
"_id": "$_id.code",
"score": { "$sum": 1 }
}}
// Now filter to keep only results with score 2
{ "$match": { "score": 2 } }
])
如果这是您第一次使用聚合框架,则可能需要一点时间.请花一些时间查看此处链接所定义的运算符,并查看一般而言,聚合管道运营商.
除了简单的数据选择之外,这是使用MongoDB时应最常使用的工具,因此您将很容易理解所有可能的操作.
Beyond simple data selection, this is the tool you should be reaching to most often when using MongoDB, so you would do well to understand all the operations that are possible.
这篇关于在单个查询中匹配来自两个查询的键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!