我在数据框中有包含数字数据的多个列。我想四分位数的每一列,将每个值更改为q1,q2,q3或q4。
我目前遍历每一列,并使用pandas qcut函数更改它们:
for column_name in df.columns:
df[column_name] = pd.qcut(df[column_name].astype('float'), 4, ['q1','q2','q3','q4'])
这很慢!有更快的方法吗?
最佳答案
稍微玩一下下面的示例。看起来从字符串转换为float会增加时间。尽管未提供工作示例,所以无法知道原始类型。不论是否复制,df[column].astype(copy=)
似乎都是高性能的。没什么可追的。
import pandas as pd
import numpy as np
import random
import time
random.seed(2)
indexes = [i for i in range(1,10000) for _ in range(10)]
df = pd.DataFrame({'A': indexes, 'B': [str(random.randint(1,99)) for e in indexes], 'C':[str(random.randint(1,99)) for e in indexes], 'D':[str(random.randint(1,99)) for e in indexes]})
#df = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
df_result = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
def qcut(copy, x):
for i, column_name in enumerate(df.columns):
s = pd.qcut(df[column_name].astype('float', copy=copy), 4, ['q1','q2','q3','q4'])
df_result["col %d %d"%(x, i)] = s.values
times = []
for x in range(0,10):
a = time.clock()
qcut(True, x)
b = time.clock()
times.append(b-a)
print np.mean(times)
for x in range(10, 20):
a = time.clock()
qcut(False, x)
b = time.clock()
times.append(b-a)
print np.mean(times)
关于python - 优化 Pandas 数据框的列的Quartiling?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/55184330/