问题描述
我有一个0.7 GB的MongoDB数据库,其中包含要尝试加载到数据帧中的推文.但是,我得到一个错误.
I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.
MemoryError:
我的代码如下:
cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)
我已经尝试了以下答案中的方法,这些方法有时会在加载数据库之前创建数据库所有元素的列表.
I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.
- https://stackoverflow.com/a/17805626/2297475
- https://stackoverflow.com/a/16255680/2297475
但是,在另一个有关list()的答案中,此人表示这对小型数据集非常有用,因为所有内容都已加载到内存中.
However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.
就我而言,我认为这是错误的根源.太多数据无法加载到内存中.我还能使用什么其他方法?
In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?
推荐答案
我已将代码修改为以下内容:
I've modified my code to the following:
cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)
通过在find()函数中添加 fields 参数,我限制了输出.这意味着我没有将每个字段都加载,而是仅将所选字段加载到DataFrame中.现在一切正常.
By adding the fields parameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.
这篇关于使用Pandas和PyMongo将MongoDB数据加载到DataFrame的更好方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!