我想将一个数据框处理为以下格式。

projectid    vendor_name    project_resource_type    item_quantity
12345        amazon            tech                       5
12345        best buy          supplies                   2
abcde        amazon            tech                       1


总之,我想将数据框处理成这样

projectid    amazon best_buy tech supplies total_quantity
12345           1.      1.    1.    1.         7
abcde           1.      0.    0.    1.         1.


所以我做了以下

has_vendors = pd.get_dummies(resources.vendor_name, prefix='has_vendor')
resources.drop('vendor_name', 1, inplace=True)
resources = pd.merge(resources, has_vendors, left_index=True, right_index=True, how='outer')

print 'merging resource type dummies'
resources_types = pd.get_dummies(resources.project_resource_type, prefix='has_resource_type')
resources.drop('project_resource_type', 1, inplace=True)
resources = pd.merge(resources, resources_types, left_index=True, right_index=True, how='outer')

gb = resources.groupby('projectid')

columns = [x for x in resources.columns.values if 'has_vendor' in x or 'has_resource_type' in x]
all_cols = [x for x in resources.columns.values if 'has_vendor' in x or 'has_resource_type' in x]
all_cols.append('total_quantity')

def group(x):
    vals = []
    for i,col in enumerate(columns):
        v = np.any(x[col]) + 0.
        vals.append(v)

    su = np.sum(x['item_quantity'])
    vals.append(su)

    return pd.Series(vals, index=all_cols)

resources_agg = gb.apply(group)


事实是,我发现gb.apply(group)函数的速度太慢,只有约650000个唯一的项目ID。还有其他方法可以加快速度吗?

最佳答案

您可以尝试pivot_table是否会更快:

>>> aggrfn = lambda ts: 1 if 0 < ts.sum() else 0
>>> df.pivot_table('item_quantity', 'projectid', 'vendor_name', aggrfn, 0)
vendor_name  amazon  best buy
projectid
12345             1         1
abcde             1         0

>>> df.pivot_table('item_quantity', 'projectid', 'project_resource_type', aggrfn, 0)
project_resource_type  supplies  tech
projectid
12345                         1     1
abcde                         0     1

>>> df.groupby('projectid')['item_quantity'].aggregate({'total_quantity':'sum'})
           total_quantity
projectid
12345                   7
abcde                   1


如果objs是包含上述结果的列表,则可以与它们一起加入:

>>> pd.concat(objs, axis=1)
           amazon  best buy  supplies  tech  total_quantity
projectid
12345           1         1         1     1               7
abcde           1         0         0     1               1

关于python - Pandas groupby申请如何加快,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/24552058/

10-12 13:42