pandas数据处理

1. 删除重复元素

使用duplicated()函数检测重复的行,返回元素为布尔类型的Series对象,每个元素对应一行,如果该行不是第一次出现,则元素为True

  • keep参数:指定保留哪一重复的行数据

  • 创建具有重复元素行的DataFrame

import numpy as np
import pandas as pd
from pandas import DataFrame
# 创建一个df
df = DataFrame(data=np.random.randint(0,100,size=(12,7)))
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
018759868333515
148713634411746
23288121865385
315353590638314
412366579255395
59863458356480
631612333805360
752476058543517
87924261314056
976453042748353
106928999125162
111786176403441
# 手动将df的某几行设置成相同的内容
df.iloc[1] = [6,6,6,6,6,6,6]
df.iloc[8] = [6,6,6,6,6,6,6]
df.iloc[5] = [6,6,6,6,6,6,6]
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
018759868333515
16666666
23288121865385
315353590638314
412366579255395
56666666
631612333805360
752476058543517
86666666
976453042748353
106928999125162
111786176403441
  • 使用drop_duplicates()函数删除重复的行

    • drop_duplicates(keep='first/last'/False)
df.drop_duplicates(keep='last')  # 保留最后一个重复的行

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
02693111384622
289245328648940
3780439132956
496925855827321
64311354243443
775328885402941
86666666
9933624605178
1058116342623010
1189939649234057

2. 映射

1) replace()函数:替换元素

DataFrame替换操作

  • 单值替换

    • 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
    • 按列指定单值替换: to_replace={列标签:替换值} value='value'
  • 多值替换

    • 列表替换: to_replace=[] value=[]
    • 字典替换(推荐) to_replace={to_replace:value,to_replace:value}
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
02693111384622
16666666
289245328648940
3780439132956
496925855827321
56666666
64311354243443
775328885402941
86666666
9933624605178
1058116342623010
1189939649234057

注意:DataFrame中,无法使用method和limit参数

df.replace(to_replace=6,value='six') # 将数据中的所有6替换成six,默认不改变原表

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
018759868333515
1sixsixsixsixsixsixsix
23288121865385
315353590638314
412366579255395
5sixsixsixsixsixsixsix
631612333805360
752476058543517
8sixsixsixsixsixsixsix
976453042748353
106928999125162
111786176403441
df.replace(to_replace={6:'six'})  # 效果同上

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
02693111384622
1sixsixsixsixsixsixsix
289245328648940
378043913295six
496925855827321
5sixsixsixsixsixsixsix
64311354243443
775328885402941
8sixsixsixsixsixsixsix
9933624605178
1058116342623010
1189939649234057
df.replace(to_replace={5:6},value='six')  # 将第5列中的6,替换成six

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
02693111384622
166666six6
289245328648940
3780439132956
496925855827321
566666six6
64311354243443
775328885402941
866666six6
9933624605178
1058116342623010
1189939649234057

2) map()函数

新建一列 , map函数并不是df的方法,而是series的方法

  • map()可以映射新一列数据

  • map()中可以使用lambd表达式

  • map()中可以使用方法,可以是自定义的方法

    eg:map({to_replace:value})

  • 注意 map()中不能使用sum之类的函数,for循环

  • 新增一列:给df中,添加一列,该列的值为中文名对应的英文名

dic = {
'name':['张三','周杰伦','张三'],
'salary':[20000,10000,20000]
}
df = DataFrame(data=dic)
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0张三20000
1周杰伦10000
2张三20000
# 映射关系表
dic = {
'张三':'tom',
'周杰伦':'jay'
}
df['e_name'] = df['name'].map(dic)
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0张三20000tom
1周杰伦10000jay
2张三20000tom

map当做一种运算工具,至于执行何种运算,是由map函数的参数决定的(参数:lambda,函数)

  • 使用自定义函数
def after_sal(s):
return s - (s-3000)*0.5
# 超过3000部分的钱缴纳50%的税
df['after_sal'] = df['salary'].map(after_sal)
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0张三20000tom11500.0
1周杰伦10000jay6500.0
2张三20000tom11500.0
df['after_sal'] = df['salary'].apply(after_sal)  # apply效率高于map
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0张三20000tom11500.0
1周杰伦10000jay6500.0
2张三20000tom11500.0
  • 使用lambda表达式
df['after_sal'] = df['salary'].apply(lambda x:x-(x-3000)*0.5)  # 或map
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0张三20000tom11500.0
1周杰伦10000jay6500.0
2张三20000tom11500.0

注意:并不是任何形式的函数都可以作为map的参数。只有当一个函数具有一个参数且有返回值,那么该函数才可以作为map的参数。

3. 使用聚合操作对数据异常值检测和过滤

使用df.std()函数可以求得DataFrame对象每一列的标准差

  • 创建一个1000行3列的df 范围(0-1),求其每一列的标准差
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
00.0563650.0809720.378327
10.3719300.0077910.318345
20.1409990.9217720.752930
30.8771100.4477560.760049
40.2121780.1437720.621486
50.2554040.1954730.008677
60.0115680.3089340.882607
70.4708680.0800490.285998
80.6590130.7948020.270541
90.3158260.8146530.906056
100.8924740.3013400.687254
110.0154840.5675980.043682
120.9576200.9676760.063608
130.1025060.4900770.235902
140.0990830.7781900.451824
150.0231480.0741690.589411
160.4258940.7726620.797658
170.9394750.7735020.766101
180.3302990.9846150.346554
190.8827350.2375460.847036
200.5785890.7308790.751632
210.5046270.7162720.386102
220.4248790.2312620.590047
230.5807380.6752680.726104
240.5072480.1364650.463764
250.4215170.8148060.449040
260.2753730.9354300.525679
270.4040310.2214920.730966
280.7791420.0634350.120807
290.6183920.5359340.554632
............
9700.3781070.6874340.567923
9710.8767700.4432190.236627
9720.4867570.4168360.524889
9730.8860210.2039590.789022
9740.8382470.2794680.333581
9750.7622300.3528780.550439
9760.0445680.6809160.350743
9770.0312320.0298390.918445
9780.3231420.6869650.978349
9790.7464710.0817730.729567
9800.8101690.7930250.993532
9810.4808490.3219840.233431
9820.4917940.0566810.429988
9830.2780190.1052900.435492
9840.4809740.0981990.958667
9850.4653960.8069550.668972
9860.6026750.9669630.338542
9870.0519710.1058330.132917
9880.4163620.8617770.832573
9890.9516510.0029120.942564
9900.2740330.0711020.941272
9910.6329130.8070600.540686
9920.0350060.5269700.058584
9930.3689570.3955930.210440
9940.6928470.6554920.877564
9950.2455930.0035510.913750
9960.3748040.3116040.680521
9970.3559280.9243300.224949
9980.9230600.8347400.275359
9990.9053360.4822900.722851

1000 rows × 3 columns

对df应用筛选条件,去除标准差太大的数据:假设过滤条件为 C列数据大于两倍的C列标准差

twice_std = df['C'].std() * 2
twice_std
0.5714973528631762
~(df['C'] > twice_std)
0       True
1 True
2 False
3 False
4 False
5 True
6 False
7 True
8 True
9 False
10 False
11 True
12 True
13 True
14 True
15 False
16 False
17 False
18 True
19 False
20 False
21 True
22 False
23 False
24 True
25 True
26 True
27 False
28 True
29 True
...
970 True
971 True
972 True
973 False
974 True
975 True
976 True
977 False
978 False
979 False
980 False
981 True
982 True
983 True
984 False
985 False
986 True
987 True
988 False
989 False
990 False
991 True
992 True
993 True
994 False
995 False
996 False
997 True
998 True
999 False
Name: C, Length: 1000, dtype: bool
df.loc[~(df['C'] > twice_std)]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
00.0563650.0809720.378327
10.3719300.0077910.318345
50.2554040.1954730.008677
70.4708680.0800490.285998
80.6590130.7948020.270541
110.0154840.5675980.043682
120.9576200.9676760.063608
130.1025060.4900770.235902
140.0990830.7781900.451824
180.3302990.9846150.346554
210.5046270.7162720.386102
240.5072480.1364650.463764
250.4215170.8148060.449040
260.2753730.9354300.525679
280.7791420.0634350.120807
290.6183920.5359340.554632
310.6162510.0349840.342615
330.0095740.1959870.221378
350.7216090.5183110.561978
360.3169930.6780540.163737
370.4943550.4999860.560351
390.5848630.8817380.481162
430.1603690.4023880.208208
450.0026980.5765280.070493
470.7648830.7789270.494559
480.8686430.3929030.109240
490.0589280.3505040.497170
500.3734900.7835540.335720
550.6380660.4423820.173654
560.8372180.7226850.454352
............
9430.3222680.9574960.108147
9440.3844630.4903860.245737
9450.3826110.7268880.345724
9470.7133370.8280640.364005
9480.8187030.4458250.281585
9510.9686510.8971880.368103
9520.1361360.4313000.543917
9540.8461050.0645270.200963
9550.7081070.8575700.475146
9570.5958190.0607630.294676
9580.2680460.7901280.342255
9590.1166450.9687890.493773
9670.0736650.2041680.286095
9680.2057960.5962420.468190
9700.3781070.6874340.567923
9710.8767700.4432190.236627
9720.4867570.4168360.524889
9740.8382470.2794680.333581
9750.7622300.3528780.550439
9760.0445680.6809160.350743
9810.4808490.3219840.233431
9820.4917940.0566810.429988
9830.2780190.1052900.435492
9860.6026750.9669630.338542
9870.0519710.1058330.132917
9910.6329130.8070600.540686
9920.0350060.5269700.058584
9930.3689570.3955930.210440
9970.3559280.9243300.224949
9980.9230600.8347400.275359

559 rows × 3 columns

  • 检测过滤缺失值

    • dropna
    • fillna
  • 检测过滤重复值
    • drop_duplicated(keep)
  • 检测过滤异常值
    • 得到鉴定异常值的条件
    • 将异常值对应的行删除

4. 排序

使用.take()函数排序

  • take()函数接受一个索引列表,用数字表示,使得df根据列表中索引的顺序进行排序
  • eg:df.take([1,3,4,2,5])

可以借助np.random.permutation()函数随机排序

np.random.permutation(1000)  #  将0-999进行随机排列
array([956, 614, 993, 437, 371, 215, 579, 282, 301, 646, 893,   7, 441,
539, 953, 794, 155, 370, 154, 100, 753, 793, 412, 867, 941, 998,
672, 590, 708, 1, 634, 899, 417, 242, 557, 122, 397, 850, 543,
560, 389, 896, 903, 505, 685, 334, 665, 460, 768, 937, 522, 637,
121, 605, 107, 130, 286, 532, 982, 563, 995, 89, 217, 218, 82,
781, 951, 798, 200, 947, 790, 398, 538, 411, 15, 44, 784, 205,
281, 314, 439, 132, 192, 238, 795, 470, 65, 842, 259, 426, 528,
383, 682, 750, 119, 465, 503, 278, 715, 603, 544, 265, 239, 569,
204, 616, 343, 710, 653, 256, 6, 873, 338, 27, 570, 707, 70,
73, 233, 838, 799, 266, 859, 279, 136, 479, 724, 870, 611, 574,
564, 655, 177, 39, 253, 148, 471, 317, 661, 851, 69, 523, 513,
928, 650, 23, 582, 622, 814, 959, 723, 938, 612, 912, 865, 402,
638, 80, 962, 214, 983, 194, 680, 758, 29, 74, 86, 102, 583,
695, 580, 835, 931, 832, 454, 258, 493, 967, 670, 555, 494, 501,
581, 591, 179, 354, 118, 671, 380, 732, 229, 719, 623, 874, 495,
944, 900, 123, 250, 628, 244, 872, 731, 625, 586, 57, 752, 596,
827, 775, 841, 163, 394, 833, 153, 669, 295, 826, 384, 890, 711,
60, 141, 237, 198, 404, 463, 712, 960, 749, 510, 866, 609, 26,
169, 372, 459, 365, 949, 124, 733, 12, 257, 668, 878, 487, 138,
652, 300, 219, 413, 445, 193, 207, 337, 779, 77, 95, 693, 812,
409, 33, 490, 992, 9, 167, 358, 743, 369, 99, 817, 542, 706,
289, 589, 666, 927, 391, 761, 844, 452, 66, 830, 498, 968, 689,
329, 508, 526, 335, 884, 129, 972, 507, 480, 274, 110, 425, 500,
388, 418, 869, 769, 251, 863, 456, 112, 247, 304, 478, 481, 429,
741, 241, 347, 37, 673, 427, 285, 415, 59, 853, 144, 822, 125,
455, 64, 332, 71, 971, 763, 804, 19, 191, 918, 608, 61, 327,
137, 116, 746, 482, 828, 766, 691, 424, 727, 468, 633, 302, 861,
848, 134, 704, 491, 320, 280, 660, 375, 846, 359, 987, 511, 342,
307, 399, 76, 825, 11, 28, 961, 485, 451, 675, 457, 618, 554,
551, 885, 531, 880, 534, 160, 607, 367, 374, 797, 910, 970, 595,
575, 756, 90, 897, 801, 49, 140, 985, 512, 577, 922, 168, 225,
360, 315, 350, 919, 231, 911, 631, 31, 774, 103, 186, 892, 293,
483, 149, 860, 887, 93, 340, 744, 908, 52, 196, 222, 955, 3,
930, 571, 484, 156, 50, 843, 599, 506, 936, 703, 881, 273, 520,
41, 85, 328, 223, 48, 492, 97, 56, 36, 974, 924, 656, 58,
649, 92, 114, 62, 173, 984, 973, 346, 573, 996, 597, 990, 667,
206, 917, 213, 272, 462, 686, 469, 472, 236, 643, 787, 224, 120,
255, 24, 171, 94, 904, 771, 344, 556, 981, 593, 988, 271, 762,
363, 254, 535, 361, 979, 303, 692, 964, 504, 150, 894, 349, 796,
714, 525, 943, 785, 260, 145, 292, 718, 811, 234, 641, 403, 818,
999, 461, 778, 802, 901, 352, 40, 515, 32, 877, 664, 323, 966,
635, 905, 754, 940, 810, 182, 75, 442, 308, 262, 776, 592, 267,
203, 294, 657, 34, 414, 405, 232, 151, 373, 601, 14, 807, 467,
421, 43, 935, 430, 287, 313, 283, 152, 516, 530, 356, 559, 518,
644, 889, 977, 521, 548, 381, 674, 929, 0, 916, 246, 540, 297,
67, 980, 422, 117, 772, 53, 13, 91, 46, 423, 509, 21, 128,
598, 115, 610, 679, 783, 264, 78, 270, 824, 311, 648, 220, 636,
226, 658, 886, 227, 268, 773, 620, 529, 864, 502, 567, 713, 963,
366, 210, 333, 249, 600, 701, 2, 640, 407, 745, 942, 113, 87,
390, 159, 188, 948, 957, 488, 351, 288, 245, 431, 248, 164, 767,
839, 702, 803, 792, 594, 837, 489, 934, 684, 386, 629, 519, 876,
63, 448, 98, 858, 378, 298, 368, 453, 25, 868, 624, 79, 133,
902, 906, 428, 401, 162, 157, 728, 950, 662, 190, 496, 568, 975,
952, 627, 909, 994, 131, 780, 751, 883, 871, 319, 722, 199, 536,
209, 821, 318, 290, 393, 35, 325, 187, 786, 681, 284, 514, 331,
647, 855, 143, 989, 642, 96, 676, 986, 561, 602, 336, 20, 379,
847, 735, 954, 645, 547, 357, 447, 435, 739, 228, 566, 305, 353,
158, 755, 716, 730, 856, 127, 47, 392, 862, 809, 720, 760, 432,
243, 932, 208, 382, 585, 747, 111, 836, 736, 700, 705, 615, 355,
18, 330, 820, 8, 857, 184, 175, 221, 737, 524, 697, 436, 395,
764, 939, 104, 759, 819, 240, 659, 147, 269, 387, 420, 621, 364,
926, 201, 549, 165, 696, 742, 997, 181, 277, 726, 10, 683, 991,
291, 81, 126, 68, 920, 808, 572, 740, 533, 699, 72, 146, 230,
888, 5, 606, 466, 263, 458, 898, 604, 385, 805, 105, 211, 945,
958, 721, 823, 376, 497, 545, 576, 738, 626, 852, 449, 541, 444,
406, 976, 88, 815, 552, 166, 183, 178, 438, 553, 84, 83, 717,
651, 782, 678, 324, 584, 42, 687, 517, 195, 106, 101, 933, 434,
348, 440, 587, 310, 923, 663, 921, 499, 565, 296, 38, 891, 895,
316, 30, 978, 677, 170, 322, 613, 546, 527, 630, 476, 174, 51,
816, 845, 185, 108, 17, 321, 813, 806, 109, 882, 197, 550, 907,
339, 698, 965, 362, 729, 914, 791, 694, 475, 879, 486, 309, 748,
326, 688, 202, 410, 915, 690, 854, 377, 341, 788, 22, 777, 275,
473, 261, 400, 45, 54, 135, 770, 189, 946, 562, 925, 537, 789,
312, 829, 725, 252, 800, 578, 446, 55, 419, 396, 4, 558, 212,
831, 450, 299, 161, 617, 345, 306, 757, 709, 180, 235, 433, 840,
477, 913, 474, 734, 408, 443, 834, 654, 875, 172, 632, 416, 16,
216, 464, 139, 619, 588, 969, 176, 276, 142, 639, 765, 849])
# 行排序与列排序均随机
df.take(indices=np.random.permutation(1000),axis=0).take(indices=np.random.permutation(3),axis=1)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
8100.0564620.8369140.105296
20.1409990.7529300.921772
7210.9419860.2065680.283233
8030.3022480.0279690.946815
460.5763910.6047950.199215
2240.0915630.4488960.460941
6820.0818940.3600090.174743
8940.7582210.3119320.054626
3890.9511420.1744180.764700
4410.2836970.5773700.698306
3500.6234450.6812110.547610
530.1862170.6173440.339724
4670.2319150.5405580.972880
9620.5434420.8956280.444214
5980.5161100.0473930.670478
3370.0220560.2226980.010719
4810.1828050.3012500.652167
2770.1275610.7495320.170472
1620.7670040.2615410.381312
2500.8470710.3449570.539958
4160.3692740.4956000.393579
4250.2281960.2736550.114908
8430.3949740.9043970.875514
8930.4518440.3363450.787189
4920.5166250.4999290.350670
4530.2188780.9572510.308231
1860.6112240.9817650.809362
2430.0926590.3742120.658671
5220.7737740.4363750.037527
9610.1721330.7622210.800747
............
6240.5874350.1835520.831386
6750.6362480.5429040.918788
8610.5192020.3229430.315798
9890.9516510.9425640.002912
1360.9406080.0698350.504026
9500.2948720.7123610.821118
5290.6483020.8604930.626701
8330.7835010.8233260.357173
1730.1810900.6971540.906783
6150.1770690.7325580.275658
1820.0916860.2624770.340532
9130.0698500.9037230.102737
4170.3537720.3453100.618327
4870.6974150.0834220.921608
3450.3315070.2957550.995060
9780.3231420.9783490.686965
1970.9479770.2355330.295503
1330.4284080.9632030.485624
2140.8615410.8404860.435903
6400.4539340.8072530.940066
9770.0312320.9184450.029839
6980.7801590.0422820.127449
4270.3264110.1016160.915007
8980.7689110.2316290.451036
770.7182000.6827570.986735
8650.5531710.5357610.088467
5130.2036010.9082380.116113
7110.6557780.1649410.472295
6850.0121720.0353560.501114
8010.8918550.3554260.682663

1000 rows × 3 columns

  • np.random.permutation(x)可以生成x个从0-(x-1)的随机数列
df.take(indices=np.random.permutation(1000),axis=0).take(indices=np.random.permutation(3),axis=1)[0:5]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
8390.8171630.3466610.113644
7080.6444560.3270890.081710
2440.8528330.3668200.028656
7280.6271860.8509470.375577
2380.7841790.7642400.579280

随机抽样

当DataFrame规模足够大时,直接使用np.random.permutation(x)函数,就配合take()函数实现随机抽样

5. 数据分类处理【重点】

数据聚合是数据处理的最后一步,通常是要使每一个数组生成一个单一的数值。

数据分类处理:

  • 分组:先把数据分为几组
  • 用函数处理:为不同组的数据应用不同的函数以转换数据
  • 合并:把不同组得到的结果合并起来

数据分类处理的核心:

  • groupby()函数
  • groups属性查看分组情况
  • eg: df.groupby(by='item').groups

分组

df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
'price':[4,3,3,2.5,4,2],
'color':['red','yellow','yellow','green','green','green'],
'weight':[12,20,50,30,20,44]})
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0redApple4.012
1yellowBanana3.020
2yellowOrange3.050
3greenBanana2.530
4greenOrange4.020
5greenApple2.044
  • 使用groupby实现分组
df.groupby(by='item',axis=0)
<pandas.core.groupby.DataFrameGroupBy object at 0x00000235AA6F6C18>
  • 使用groups查看分组情况
# 该函数可以进行数据的分组,但是不显示分组情况
df.groupby(by='item',axis=0).groups
{'Apple': Int64Index([0, 5], dtype='int64'),
'Banana': Int64Index([1, 3], dtype='int64'),
'Orange': Int64Index([2, 4], dtype='int64')}
  • 分组后的聚合操作:分组后的成员中可以被进行运算的值会进行运算,不能被运算的值不进行运算
# 给df创建一个新列,内容为各个水果的平均价格
df.groupby(by='item',axis=0).mean()['price']
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
mean_price_series = df.groupby(by='item',axis=0)['price'].mean()
mean_price_series
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
# 映射关系表
dic = mean_price_series.to_dict()
df['mean_price'] = df['item'].map(dic)
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0redApple4.0123.00
1yellowBanana3.0202.75
2yellowOrange3.0503.50
3greenBanana2.5302.75
4greenOrange4.0203.50
5greenApple2.0443.00

计算出苹果的平均价格

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0redApple4.012
1yellowBanana3.020
2yellowOrange3.050
3greenBanana2.530
4greenOrange4.020
5greenApple2.044

按颜色查看各种颜色的水果的平均价格

汇总:将各种颜色水果的平均价格和df进行汇总

df['color_mean_price'] = df['color'].map(df.groupby(by='color')['price'].mean().to_dict())
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0redApple4.0123.004.000000
1yellowBanana3.0202.753.000000
2yellowOrange3.0503.503.000000
3greenBanana2.5302.752.833333
4greenOrange4.0203.502.833333
5greenApple2.0443.002.833333

6.高级数据聚合

使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算

  • df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
  • transform和apply都会进行运算,在transform或者apply中传入函数即可
  • transform和apply也可以传入一个lambda表达式
df.groupby(by='item')['price'].mean()
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
def my_mean(s):
sum = 0
for i in s:
sum += i
return sum/len(s)
# 使用apply函数求出水果的平均价格
df.groupby(by='item')['price'].apply(my_mean)
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
# 使用transform函数求出水果的平均价格
df.groupby(by='item')['price'].transform(my_mean)
0    3.00
1 2.75
2 3.50
3 2.75
4 3.50
5 3.00
Name: price, dtype: float64
05-11 17:01