我有格式的熊猫数据框:
line_idno item_name \
sitename ts_placed order_idno
www.mattressesworld.co.uk 47 5242367 4112061 a
www.bedroomfurnitureworld.co.uk 47 5242295 4111977 b
5242295 4111979 a
5242295 4111978 v
5242295 4111980 a
www.mattressesworld.co.uk 47 5242300 4111986 b
www.bedroomfurnitureworld.co.uk 47 5242294 4111973 v
这有3个索引(“ sitename”,“ ts_placed”,“ order_idno”)
哪里:
“ ts_placed”代表一年中的几周
“ sitename”代表站点名称
'order_idno'订单ID号
和5列(“ line_idno”,“ item_name”,“ item_qty”,“ item_price”,“ revenue”)。
通过以下功能将数据分为多个级别:
grouped = data.groupby(level=['sitename','ts_placed','order_idno']).sum()
grouped0 = grouped.dropna()
结果:
line_idno item_qty \
sitename ts_placed order_idno
www.bedroomfurnitureworld.co.uk 38 5156953 3994322 1
5156956 3994325 1
5157144 3994580 1
5157191 3994641 0
5157198 3994651 1
5157217 3994678 2
5157218 3994679 2
5157233 3994697 1
5157257 7989463 2
我想获得的是每个站点每周平均收入的细分。换句话说,每个ts_placed /网站名称组的所有收入之和除以每个ts_placed的行数。
最佳答案
这是一个可重现的示例。
import pandas as pd
import numpy as np
# simulate some artificial data
np.random.seed(1)
sites = ['www.{}.co.uk'.format(x) for x in 'AAA BBB CCC DDD EEE'.split()]
sitename = np.random.choice(sites, size=1000)
ts_placed = np.random.choice(np.arange(47, 53), size=1000)
order_idno = np.random.choice(np.arange(520000, 550000), size=1000)
item_name = np.random.choice('a b c d e f'.split(), size=1000)
line_idno = np.random.choice(np.arange(3960000, 4000000), size=1000)
item_qty = np.random.choice(np.arange(0, 10), size=1000)
item_price = np.random.choice(np.arange(1000, 10000), size=1000)
revenue = item_price * item_qty
data = pd.DataFrame(dict(sitename=sitename, ts_placed=ts_placed, order_idno=order_idno, item_name=item_name, line_idno=line_idno, item_qty=item_qty, item_price=item_price, revenue=revenue)).set_index(['sitename', 'ts_placed', 'order_idno'])
Out[17]:
item_name item_price item_qty line_idno revenue
sitename ts_placed order_idno
www.DDD.co.uk 47 526418 a 4514 1 3989144 4514
www.EEE.co.uk 52 539155 d 4952 5 3965922 24760
www.AAA.co.uk 52 539417 d 8816 0 3988185 0
www.BBB.co.uk 49 523800 b 3340 3 3981971 10020
www.DDD.co.uk 48 521464 f 4402 6 3976820 26412
www.AAA.co.uk 49 521706 c 8436 5 3963275 42180
52 544452 c 7220 8 3992357 57760
www.BBB.co.uk 50 548184 d 3389 9 3976608 30501
www.EEE.co.uk 49 527830 f 8110 1 3998527 8110
521908 a 7292 4 3964393 29168
www.BBB.co.uk 47 527558 b 4945 6 3977830 29670
www.CCC.co.uk 47 549572 f 3350 5 3988678 16750
www.EEE.co.uk 48 522511 f 1865 0 3992356 0
www.CCC.co.uk 51 520156 e 4717 8 3974344 37736
www.EEE.co.uk 50 534951 b 3738 9 3978519 33642
... ... ... ... ... ...
www.CCC.co.uk 50 525279 e 5961 0 3980873 0
www.DDD.co.uk 48 539486 c 2028 4 3978442 8112
www.EEE.co.uk 48 543216 e 3721 6 3986919 22326
www.BBB.co.uk 51 525662 c 1264 7 3987129 8848
www.CCC.co.uk 52 546208 e 7287 4 3999828 29148
www.AAA.co.uk 48 544288 a 7708 1 3974546 7708
www.DDD.co.uk 52 538708 f 9080 7 3983499 63560
www.CCC.co.uk 48 536774 a 8971 2 3968092 17942
www.BBB.co.uk 48 528310 c 3284 2 3985896 6568
www.AAA.co.uk 49 549547 c 4265 4 3960981 17060
544394 c 2268 8 3982739 18144
52 540515 f 4476 5 3987786 22380
www.EEE.co.uk 50 540388 f 1226 5 3980156 6130
47 522633 f 4185 5 3964986 20925
www.CCC.co.uk 49 532710 c 7462 2 3984676 14924
[1000 rows x 5 columns]
# your custom apply funciton
def apply_func(group):
avg_revenue = group.revenue.mean()
count_unique_order = len(group.order_idno.unique())
# or try this
# count_unique_order = group.order_idno.value_counts().count()
return pd.Series({'avg_revenue': avg_revenue, 'count_unique_order': count_unique_order})
# use the customized apply funciton
data.reset_index(level='order_idno').dropna().groupby(level=['sitename', 'ts_placed']).apply(apply_func)
Out[46]:
avg_revenue count_unique_order
sitename ts_placed
www.AAA.co.uk 47 23501.8158 10
48 23003.9355 10
49 24254.1212 10
50 23254.6410 10
51 19173.8966 10
52 23845.6786 10
www.BBB.co.uk 47 26136.0882 10
48 23007.3929 9
49 30605.2857 10
50 19530.3871 10
51 21768.6667 9
52 28387.5455 10
www.CCC.co.uk 47 28917.3448 9
48 23659.3488 10
49 26209.0625 9
50 22929.2564 10
51 23474.2857 9
52 22123.3429 10
www.DDD.co.uk 47 27176.2778 10
48 24530.6154 10
49 23601.8710 9
50 27749.2162 10
51 26816.0000 9
52 29910.5455 10
www.EEE.co.uk 47 27270.6471 10
48 23498.0789 10
49 26682.4250 10
50 24524.4400 10
51 15635.2500 10
52 20917.2500 10
关于python - 如何在python的groupby函数中获取计数,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/31077896/