问题描述
我正在将 Twitter 数据(推文 + 元数据)收集到 MongoDB 服务器中.现在我想做一些统计分析.为了将 MongoDB 中的数据放入 Pandas 数据框中,我使用了以下代码:
I'm collecting Twitter data (tweets + meta data) into a MongoDB server. Now I want to do some statistical analysis. To get the data from MongoDB into a Pandas data frame I used the following code:
cursor = collection.find({},{'id': 1, 'text': 1})
tweet_fields = ['id', 'text']
result = pd.DataFrame(list(cursor), columns = tweet_fields)
这样我就成功地将数据加载到了 Pandas 中,这很棒.现在我想对创建推文的用户进行一些分析,这些用户也是我收集的数据.此数据位于 JSON 的嵌套部分(我不能 100% 确定这是否是真正的 JSON),例如 user.id,它是 Twitter 用户帐户的 ID.
This way i successfully loaded the data into Pandas, which is great. Now I wanted to do some analysis on the users that created the tweets which was also data I collected. This data is located in a nested part of the JSON (I'm not 100% sure if this is true JSON), for instance user.id which is the id of the Twitter user account.
我可以使用点符号将其添加到光标中:
I can just add that to the cursor using dot notation:
cursor = collection.find({},{'id': 1, 'text': 1, 'user.id': 1})
但这会导致该列出现 NaN.我发现问题在于数据的结构方式:
But this results in a NaN for that column. I found that the problem lies with the way the data is structured:
没有user.id的光标位:
bit of the cursor without user.id:
[{'_id': ObjectId('561547ae5371c0637f57769e'),
'id': 651795711403683840,
'text': 'Video: Zuuuu gut! Caro Korneli besucht für extra 3 Pegida Via KFMW http://t.co/BJX5GKrp7s'},
{'_id': ObjectId('561547bf5371c0637f5776ac'),
'id': 651795781557583872,
'text': 'Iets voor werkloze xenofobe PVV-ers, (en dat zijn waarschijnlijk wel de meeste).........Ze zoeken bij Frontex een paar honderd grenswachten.'},
{'_id': ObjectId('561547ab5371c0637f57769c'),
'id': 651795699881889792,
'text': 'RT @ansichtssache47: Geht gefälligst arbeiten, die #Flüchtlinge haben Hunger! http://t.co/QxUYfFjZB5 #grenzendicht #rente #ZivilerUngehorsa…'}]
带有 user.id 的光标位:
bit of the cursor with user.id:
[{'_id': ObjectId('561547ae5371c0637f57769e'),
'id': 651795711403683840,
'text': 'Video: Zuuuu gut! Caro Korneli besucht für extra 3 Pegida Via KFMW http://t.co/BJX5GKrp7s',
'user': {'id': 223528499}},
{'_id': ObjectId('561547bf5371c0637f5776ac'),
'id': 651795781557583872,
'text': 'Iets voor werkloze xenofobe PVV-ers, (en dat zijn waarschijnlijk wel de meeste).........Ze zoeken bij Frontex een paar honderd grenswachten.',
'user': {'id': 3544739837}}]
简而言之,我不明白如何在 Pandas 数据框的单独列中获取所收集数据的嵌套部分.
So in short I don't understand how I get the nested part of my collected data in a separate column of my Pandas data frame.
推荐答案
我使用这样的函数将嵌套的 JSON 行放入数据帧中.它使用了方便的 pandas json.normalize
函数:
I use a function like this to get nested JSON lines into a dataframe. It uses the handy pandas json.normalize
function:
import pandas as pd
from bson import json_util, ObjectId
from pandas.io.json import json_normalize
import json
def mongo_to_dataframe(mongo_data):
sanitized = json.loads(json_util.dumps(mongo_data))
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df
只需通过将 mongo 数据作为参数调用函数来传递它即可.
Just pass your mongo data by calling the function with it as an argument.
sanitized = json.loads(json_util.dumps(mongo_data))
将 JSON 行作为常规 JSON 加载
sanitized = json.loads(json_util.dumps(mongo_data))
loads the JSON lines as regular JSON
normalized = json_normalize(sanitized)
取消嵌套数据
df = pd.DataFrame(normalized)
简单地把它变成一个数据帧
df = pd.DataFrame(normalized)
simply turns it into a dataframe
这篇关于将嵌套数据从 MongoDB 获取到 Pandas 数据框中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!