我有一个1200万行的数据集,其中3列作为唯一标识符,另外2列具有值。我正在尝试做一个简单的任务:
-按三个标识符分组。产生约260万种独特组合
-任务1:计算Val1列的中位数
-任务2:根据Val1的某些条件,计算Val2列的平均值

这是我使用pandasdata.table(当前最新版本,在同一台计算机上)的结果:

+-----------------+-----------------+------------+
|                 |      pandas     | data.table |
+-----------------+-----------------+------------+
| TASK 1          | 150 seconds     | 4 seconds  |
| TASK 1 + TASK 2 |  doesn't finish | 5 seconds  |
+-----------------+-----------------+------------+

我想我对 Pandas 可能做错了-将Grp1Grp2转换为类别并没有太大帮助,在.agg.apply之间切换也无济于事。有任何想法吗?

下面是可复制的代码。
数据框生成:
import numpy as np
import pandas as pd
from collections import OrderedDict
import time

np.random.seed(123)
list1 = list(pd.util.testing.rands_array(10, 750))
list2 = list(pd.util.testing.rands_array(10, 700))
list3 = list(np.random.randint(100000,200000,5))

N = 12 * 10**6 # please make sure you have enough RAM
df = pd.DataFrame({'Grp1': np.random.choice(list1, N, replace = True),
                   'Grp2': np.random.choice(list2, N, replace = True),
                   'Grp3': np.random.choice(list3, N, replace = True),
                   'Val1': np.random.randint(0,100,N),
                   'Val2': np.random.randint(0,10,N)})


# this works and shows there are 2,625,000 unique combinations
df_test = df.groupby(['Grp1','Grp2','Grp3']).size()
print(df_test.shape[0]) # 2,625,000 rows

# export to feather so that same df goes into R
df.to_feather('file.feather')

Python中的任务1:
# TASK 1: 150 seconds (sorted / not sorted doesn't seem to matter)
df.sort_values(['Grp1','Grp2','Grp3'], inplace = True)
t0 = time.time()
df_agg1 = df.groupby(['Grp1','Grp2','Grp3']).agg({'Val1':[np.median]})
t1 = time.time()
print("Duration for complex: %s seconds ---" % (t1 - t0))

Python中的任务1 +任务2:
# TASK 1 + TASK 2: this kept running for 10 minutes to no avail
# (sorted / not sorted doesn't seem to matter)
def f(x):
    d = OrderedDict()
    d['Median_all'] = np.median(x['Val1'])
    d['Median_lt_5'] = np.median(x['Val1'][x['Val2'] < 5])
    return pd.Series(d)

t0 = time.time()
df_agg2 = df.groupby(['Grp1','Grp2','Grp3']).apply(f)
t1 = time.time()
print("Duration for complex: %s seconds ---" % (t1 - t0)) # didn't complete

等效的R代码:
library(data.table)
library(feather)

DT = setDT(feater("file.feather"))
system.time({
DT_agg <- DT[,.(Median_all = median(Val1),
                Median_lt_5 = median(Val1[Val2 < 5])  ), by = c('Grp1','Grp2','Grp3')]
}) # 5 seconds

最佳答案

我无法重现您的R结果,我修复了错误拼写羽毛的错字,但是得到了以下信息:

Error in `[.data.table`(DT, , .(Median_all = median(Val1), Median_lt_5 = median(Val1[Val2 <  :
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]

对于python示例,如果要获取Val2小于5的每个组的中位数,则应首先进行过滤,如下所示:
 df[df.Val2 < 5].groupby(['Grp1','Grp2','Grp3'])['Val2'].median()

这在Macbook pro上不到8秒即可完成。

关于r - 为什么R的data.table比pandas快得多?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/49322036/

10-17 00:06