问题描述
我有一个Firebase数据库的转储,代表了存储在JSON中的Users表。我想对其进行一些数据分析,但问题在于,它太大而无法完全加载到内存中,并且使用纯JavaScript(或 _
和类似的库)进行操作。 p>
到目前为止,我一直在使用包处理我的数据在一小块(它调用JSON转储每个用户一次回调)。
我现在遇到了一个障碍,但因为我想根据它们的值来过滤我的用户id。我试图回答的问题的形式是哪些用户x,而之前我只是问有多少用户x,而不需要知道他们是谁。
数据格式如下:
{
users:{
123:{
foo:4
},
567:{
foo:8
}
}
}
我想要做的就是获得用户ID( 123
或 567
在上面)基于 foo
的值。现在,如果这是一个小列表,使用 _。each
来遍历键和值并提取我想要的键将是微不足道的。
不幸的是,因为它不适合内存不起作用。使用JSONStream,我可以通过使用 var parser = JSONStream.parse('users。*');
来遍历它,然后将它管道化为一个函数,如下所示: / p>
var stream = fs.createReadStream('my.json');
stream.pipe(parser);
$ b parser.on('data',function(user){
// user在这里等于{foo:bar}
//所以做我的微不足道过滤
//但是我不知道哪个用户ID拥有数据
});
但问题是我没有访问代表我通过星形通配符的键到 JSONStream.parse
。换句话说,我不知道是否 {foo:bar}
表示用户 123
或用户 567
。
问题有两方面:
-
- 有没有更好的方法来处理这个太大的JSON数据适合内存吗?
解决方案我继续编辑JSONStream来添加这个功能。 p>
如果有人遇到这个问题,并且希望以类似的方式修补它,可以替换
第83行
,这是
$ b $stream.queue(this.value [this.key])
用这个:
var ret = {};
ret [this.key] = this.value [this.key];
stream.queue(ret);
在原始问题的代码示例中,而不是
user
在回调中等于{foo:bar}
,现在就是{uid:{foo:bar}} $
由于这是一个突破性的变化,我没有提交一个拉请求回到原来的项目,但我没有把它留在这些问题,以防万一他们想为将来添加一个标志或选项。
I have a dump of a Firebase database representing our Users table stored in JSON. I want to run some data analysis on it but the issue is that it's too big to load into memory completely and manipulate with pure JavaScript (or
_
and similar libraries).Up until now I've been using the JSONStream package to deal with my data in bite-sized chunks (it calls a callback once for each user in the JSON dump).
I've now hit a roadblock though because I want to filter my user ids based on their value. The "questions" I'm trying to answer are of the form "Which users x" whereas previously I was just asking "How many users x" and didn't need to know who they were.
The data format is like this:
{ users: { 123: { foo: 4 }, 567: { foo: 8 } } }
What I want to do is essentially get the user ID (
123
or567
in the above) based on the value offoo
. Now, if this were a small list it would be trivial to use something like_.each
to iterate over the keys and values and extract the keys I want.Unfortunately, since it doesn't fit into memory that doesn't work. With JSONStream I can iterate over it by using
var parser = JSONStream.parse('users.*');
and piping it into a function that deals with it like this:var stream = fs.createReadStream('my.json'); stream.pipe(parser); parser.on('data', function(user) { // user is equal to { foo: bar } here // so it is trivial to do my filter // but I don't know which user ID owns the data });
But the problem is that I don't have access to the key representing the star wildcard that I passed into
JSONStream.parse
. In other words, I don't know if{ foo: bar}
represents user123
or user567
.The question is twofold:
- How can I get the current path from within my callback?
- Is there a better way to be dealing with this JSON data that is too big to fit into memory?
解决方案I went ahead and edited JSONStream to add this functionality.
If anyone runs across this and wants to patch it similarly, you can replace
line 83
which was previouslystream.queue(this.value[this.key])
with this:
var ret = {}; ret[this.key] = this.value[this.key]; stream.queue(ret);
In the code sample from the original question, rather than
user
being equal to{ foo: bar }
in the callback it will now be{ uid: { foo: bar } }
Since this is a breaking change I didn't submit a pull request back to the original project but I did leave it in the issues in case they want to add a flag or option for this in the future.
这篇关于处理JSON对象太大,不适合内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!