我想将一个数据框处理为以下格式。
projectid vendor_name project_resource_type item_quantity
12345 amazon tech 5
12345 best buy supplies 2
abcde amazon tech 1
总之,我想将数据框处理成这样
projectid amazon best_buy tech supplies total_quantity
12345 1. 1. 1. 1. 7
abcde 1. 0. 0. 1. 1.
所以我做了以下
has_vendors = pd.get_dummies(resources.vendor_name, prefix='has_vendor')
resources.drop('vendor_name', 1, inplace=True)
resources = pd.merge(resources, has_vendors, left_index=True, right_index=True, how='outer')
print 'merging resource type dummies'
resources_types = pd.get_dummies(resources.project_resource_type, prefix='has_resource_type')
resources.drop('project_resource_type', 1, inplace=True)
resources = pd.merge(resources, resources_types, left_index=True, right_index=True, how='outer')
gb = resources.groupby('projectid')
columns = [x for x in resources.columns.values if 'has_vendor' in x or 'has_resource_type' in x]
all_cols = [x for x in resources.columns.values if 'has_vendor' in x or 'has_resource_type' in x]
all_cols.append('total_quantity')
def group(x):
vals = []
for i,col in enumerate(columns):
v = np.any(x[col]) + 0.
vals.append(v)
su = np.sum(x['item_quantity'])
vals.append(su)
return pd.Series(vals, index=all_cols)
resources_agg = gb.apply(group)
事实是,我发现
gb.apply(group)
函数的速度太慢,只有约650000个唯一的项目ID。还有其他方法可以加快速度吗? 最佳答案
您可以尝试pivot_table
是否会更快:
>>> aggrfn = lambda ts: 1 if 0 < ts.sum() else 0
>>> df.pivot_table('item_quantity', 'projectid', 'vendor_name', aggrfn, 0)
vendor_name amazon best buy
projectid
12345 1 1
abcde 1 0
>>> df.pivot_table('item_quantity', 'projectid', 'project_resource_type', aggrfn, 0)
project_resource_type supplies tech
projectid
12345 1 1
abcde 0 1
>>> df.groupby('projectid')['item_quantity'].aggregate({'total_quantity':'sum'})
total_quantity
projectid
12345 7
abcde 1
如果
objs
是包含上述结果的列表,则可以与它们一起加入:>>> pd.concat(objs, axis=1)
amazon best buy supplies tech total_quantity
projectid
12345 1 1 1 1 7
abcde 1 0 0 1 1
关于python - Pandas groupby申请如何加快,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/24552058/