问题描述
我有一个pandas.DataFrame对象,其中包含约100列和200000行的数据.我正在尝试将其转换为bool数据框,其中True表示该值大于阈值,False表示该值小于阈值,并且保留NaN值.
I have a pandas.DataFrame object that contains about 100 columns and 200000 rows of data. I am trying to convert it to a bool dataframe where True means that the value is greater than the threshold, False means that it is less, and NaN values are maintained.
如果没有NaN值,我大约需要60毫秒才能运行:
If there are no NaN values, it takes about 60 ms for me to run:
df >= threshold
但是当我尝试处理NaN时,以下方法有效,但速度很慢(20秒).
But when I try to deal with the NaNs, the below method works, but is very slow (20 sec).
def func(x):
if x >= threshold:
return True
elif x < threshold:
return False
else:
return x
df.apply(lambda x: x.apply(lambda x: func(x)))
有更快的方法吗?
推荐答案
您可以这样做:
new_df = df >= threshold
new_df[df.isnull()] = np.NaN
但是,这与使用apply方法会获得的结果不同.在这里,您的蒙版具有包含NaN,0.0和1.0的float dtype.在Apply解决方案中,您将获得object
dtype,其中包含NaN,False和True.
But that is different from what you will get using the apply method. Here your mask has float dtype containing NaN, 0.0 and 1.0. In the apply solution you get object
dtype with NaN, False, and True.
两个都不能用作遮罩,因为您可能无法获得想要的东西. IEEE表示,任何NaN比较都必须产生False,并且apply方法通过返回NaN隐式违反了该方法!
Neither are OK to be used as a mask because you might not get what you want. IEEE says that any NaN comparison must yield False and the apply method is implicitly violates that by returning NaN!
最好的选择是分别跟踪NaN,并且在安装瓶颈时df.isnull()非常快.
The best option is to keep track of the NaNs separately and df.isnull() is quite fast when bottleneck is installed.
这篇关于保持NaN与 pandas 数据框不等式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!