用float32和float64缓慢分配大

用float32和float64缓慢分配大

本文介绍了用float32和float64缓慢分配大 pandas 的DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用不同的float32和float64数据类型的Pandas DataFrame的赋值对于一些组合来说是非常缓慢的。下面的代码设置了一个DataFrame,对部分数据进行Numpy / Scipy计算,通过复制旧数据帧并将计算结果赋给新的DataFrame来设置一个新的DataFrame:

 将pandas导入为pd 
将numpy导入为np
from scipy.signal import lfilter

N = 1000
M = 1000

def f(dtype1,dtype2):
coi = [str(m)for m in range(M)]
df = pd.DataFrame([ (N)]中的n),
列= coi + ['A','B'],dtype = dtype1)
Y = lfilter([1],[0.5,0.5],df.ix [:, coi])
Y = Y.astype(dtype2)
new = pd.DataFrame(df,copy = True)
print(new.iloc [0,0] .dtype)
print(Y.dtype)
new.ix [:,coi] = Y#
print(new.iloc [0,0] .dtype)


从时间导入时间

dtypes = [np.float32,
for dtype1 in dtypes:
for dtype2 in dtypes:
print(' - '* 10)
start_time = time()
f(dtype1, dtype2)
print(time() - start_time)

时间结果是: p>

  ---------- 
float32
float32
float64
10.1998147964
----------
float32
float64
float64
10.2371120453
-----------
float64
float32
float64
0.864870071411
----------
float64
float64
float64
0.866265058517

这里关键的一行是 new.ix [:, coi] = Y :这是一些组合的十倍慢。

我可以理解,当存在一个float32 DataFrame并且它被分配了一个float64时,需要重新分配一些开销。但是为什么这样的开销太戏剧性了。



此外,float32和float32赋值的组合也很慢,结果是float64,这也困扰了我。

单列赋值不会改变类型,迭代for循环over列对于非类型转换赋值似乎相当快 - float32和float64。对于涉及类型转换的分配,性能通常是多列分配的最差性能的两倍。

  import pandas as pd 
从scipy.signal导入numpy作为np
import lfilter

N = 1000
M = 1000

def f(dtype1,dtype2):
coi = [str(m)for m in range(M)]
df = pd.DataFrame([[m for m in range(M)] + ['Hello','World']对于范围(N)中的n],
列= coi + ['A','B'],dtype = dtype1)
Y = lfilter([1],[0.5,0.5],df .ix [:, coi])
Y = Y.astype(dtype2)
new = df.copy()
print(new.iloc [0,0] .dtype)
print(Y.dtype)
for n,枚举(coi)中的列:#循环遍历新的列!
new.ix [:, column] = Y [:, n]
print(new.iloc [0,0] .dtype)

from time import time

dtypes = [np.float32,np.float64]
为dtype中的dtype1:
为dtype中的dtype2:
print(' - '* 10)
start_time = time()
f(dtype1,dtype2)
print(time() - start_time)

结果是:

  ---------- 
float32
float32
float32
0.809890985489
----------
float32
float64
float64
21.4767119884
----------
float64
float32
float32
20.5611870289
----------
float64
float64
float64
0.765362977982


Assignments with a Pandas DataFrame with varying float32 and float64 datatypes are for some combinations rather slow the way I do it.

The code below sets up a DataFrame, makes a Numpy/Scipy computation on part of the data, sets up a a new DataFrame by copying the old one and assigns the result from the computation to the new DataFrame:

import pandas as pd
import numpy as np
from scipy.signal import lfilter

N = 1000
M = 1000

def f(dtype1, dtype2):
    coi = [str(m) for m in range(M)]
    df = pd.DataFrame([[m for m in range(M)] + ['Hello', 'World'] for n in range(N)],
                      columns=coi + ['A', 'B'], dtype=dtype1)
    Y = lfilter([1], [0.5, 0.5], df.ix[:, coi])
    Y = Y.astype(dtype2)
    new = pd.DataFrame(df, copy=True)
    print(new.iloc[0, 0].dtype)
    print(Y.dtype)
    new.ix[:, coi] = Y    # This statement is considerably slow
    print(new.iloc[0, 0].dtype)


from time import time

dtypes = [np.float32, np.float64]
for dtype1 in dtypes:
    for dtype2 in dtypes:
        print('-' * 10)
        start_time = time()
        f(dtype1, dtype2)
        print(time() - start_time)

The timing result is:

----------
float32
float32
float64
10.1998147964
----------
float32
float64
float64
10.2371120453
----------
float64
float32
float64
0.864870071411
----------
float64
float64
float64
0.866265058517

Here the critical line is new.ix[:, coi] = Y: It is ten times as slow for some combinations.

I can understand that there needs to be some overhead for reallocation when there is a float32 DataFrame and it is assigned a float64. But why is the overhead so dramatic.

Furthermore, the combination of float32 and float32 assignment is also slow and the result is float64, which also bothers me.

解决方案

Single-column assignments does not change type and iterating with a for-loop over columns seems reasonably fast for non-type-casting assignments, - both float32 and float64. For assignments involving type casting the performance is usually twice as bad as the worst performance for multiple column assignment

import pandas as pd
import numpy as np
from scipy.signal import lfilter

N = 1000
M = 1000

def f(dtype1, dtype2):
    coi = [str(m) for m in range(M)]
    df = pd.DataFrame([[m for m in range(M)] + ['Hello', 'World'] for n in range(N)],
                      columns=coi + ['A', 'B'], dtype=dtype1)
    Y = lfilter([1], [0.5, 0.5], df.ix[:, coi])
    Y = Y.astype(dtype2)
    new = df.copy()
    print(new.iloc[0, 0].dtype)
    print(Y.dtype)
    for n, column in enumerate(coi):  # For-loop over columns new!
        new.ix[:, column] = Y[:, n]
    print(new.iloc[0, 0].dtype)

from time import time

dtypes = [np.float32, np.float64]
for dtype1 in dtypes:
    for dtype2 in dtypes:
        print('-' * 10)
        start_time = time()
        f(dtype1, dtype2)
        print(time() - start_time)

The result is:

----------
float32
float32
float32
0.809890985489
----------
float32
float64
float64
21.4767119884
----------
float64
float32
float32
20.5611870289
----------
float64
float64
float64
0.765362977982

这篇关于用float32和float64缓慢分配大 pandas 的DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 17:17