问题描述
我有一些带有NaN的熊猫DataFrame.像这样:
I have some pandas DataFrame with NaNs in it.Like this:
import pandas as pd
import numpy as np
raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)
>>> data
A B
1 2 NaN
2 3 44
3 4 NaN
现在,我要根据它做出命令,同时删除NaN.结果应如下所示:
Now I want to make a dict out of it and at the same time remove the NaNs.The result should look like this:
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
但是使用pandas to_dict函数可以得到如下结果:
But using pandas to_dict function gives me a result like this:
>>> data.to_dict()
{'A': {1: 2, 2: 3, 3: 4}, 'B': {1: nan, 2: 44.0, 3: nan}}
那么如何从DataFrame中做出命令并摆脱NaN?
So how to make a dict out of the DataFrame and get rid of the NaNs ?
推荐答案
有很多方法可以实现此目的,我花了一些时间在一个不太大的(70k)数据帧上评估性能.尽管@der_die_das_jojo的答案可以起作用,但它的运行速度也很慢.
There are many ways you could accomplish this, I spent some time evaluating performance on a not-so-large (70k) dataframe. Although @der_die_das_jojo's answer is functional, it's also pretty slow.
实际上,这个问题在大型数据帧上的速度提高了约5倍.
The answer suggested by this question actually turns out to be about 5x faster on a large dataframe.
在我的测试数据帧(df
)上:
On my test dataframe (df
):
以上方法:
%time [ v.dropna().to_dict() for k,v in df.iterrows() ]
CPU times: user 51.2 s, sys: 0 ns, total: 51.2 s
Wall time: 50.9 s
另一种慢速方法:
%time df.apply(lambda x: [x.dropna()], axis=1).to_dict(orient='rows')
CPU times: user 1min 8s, sys: 880 ms, total: 1min 8s
Wall time: 1min 8s
我能找到的最快方法:
%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')]
CPU times: user 14.5 s, sys: 176 ms, total: 14.7 s
Wall time: 14.7 s
此输出的格式是面向行的字典,如果您要在问题中使用面向列的形式,则可能需要进行调整.
The format of this output is a row-oriented dictionary, you may need to make adjustments if you want the column-oriented form in the question.
如果有人能找到一个更快的答案对此非常感兴趣.
Very interested if anyone finds an even faster answer to this question.
这篇关于将Pandas DataFrame制作为字典和dropna的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!