我的data框架包含10,000,000行!分组后,仍有约9,000,000个子帧循环通过。

代码是:

data = read.csv('big.csv')
for id, new_df in data.groupby(level=0): # look at mini df and do some analysis
    # some code for each of the small data frames

这是 super 低效的,并且代码已经运行了十多个小时。

有没有办法加快速度?

完整代码:
d = pd.DataFrame() # new df to populate
print 'Start of the loop'
for id, new_df in data.groupby(level=0):
    c = [new_df.iloc[i:] for i in range(len(new_df.index))]
    x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index()
    x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()])
    d = pd.concat([d, x])

要获取数据:
data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])

注意:

大多数ID只有1个日期。这表示只有1次造访。对于具有更多访问次数的ID,我想将其以3D格式构建,例如将所有访问次数存储在第3维中的第2维中。输出为(id,访问次数,要素)

最佳答案

这是加快速度的一种方法。这会在一些直接处理行的代码中添加所需的新行。这节省了不断构造小数据帧的开销。您的100,000行样本在我的计算机上运行了几秒钟。而只有10,000行示例数据的代码需要> 100秒的时间。这似乎代表了几个数量级的改进。

代码:

def make_3d(csv_filename):

    def make_3d_lines(a_df):
        a_df['depth'] = 0
        depth = 0
        prev = None
        accum = []
        for row in a_df.values.tolist():
            row[0] = 0
            key = row[1]
            if key == prev:
                depth += 1
                accum.append(row)
            else:
                if depth == 0:
                    yield row
                else:
                    depth = 0
                    to_emit = []
                    for i in range(len(accum)):
                        date = accum[i][2]
                        for j, r in enumerate(accum[i:]):
                            to_emit.append(list(r))
                            to_emit[-1][0] = j
                            to_emit[-1][2] = date
                    for r in to_emit[1:]:
                        yield r
                accum = [row]
            prev = key

    df_data = pd.read_csv('big-data.csv')
    df_data.columns = ['depth'] + list(df_data.columns)[1:]

    new_df = pd.DataFrame(
        make_3d_lines(df_data.sort_values('id date'.split())),
        columns=df_data.columns
    ).astype(dtype=df_data.dtypes.to_dict())

    return new_df.set_index('id date'.split())

测试代码:
start_time = time.time()
df = make_3d('big-data.csv')
print(time.time() - start_time)

df = df.drop(columns=['feature%d' % i for i in range(3, 25)])
print(df[df['depth'] != 0].head(10))

结果:
1.7390995025634766

                          depth  feature0  feature1  feature2
id              date
207555809644681 20180104      1   0.03125  0.038623  0.008130
247833985674646 20180106      1   0.03125  0.004378  0.004065
252945024181083 20180107      1   0.03125  0.062836  0.065041
                20180107      2   0.00000  0.001870  0.008130
                20180109      1   0.00000  0.001870  0.008130
329567241731951 20180117      1   0.00000  0.041952  0.004065
                20180117      2   0.03125  0.003101  0.004065
                20180117      3   0.00000  0.030780  0.004065
                20180118      1   0.03125  0.003101  0.004065
                20180118      2   0.00000  0.030780  0.004065

10-06 05:22
查看更多