我有一个从CSV定义的数据框,想计算基本摘要统计信息,例如所有模型的训练部分的均值,方差...

插入型号并按其分组会很好-但这似乎不是一个好的解决方案。
我如何获取每个模型的摘要统计信息(仅用于培训),因为group_by modelName由于计数器不起作用。

df.groupby(['modelName', 'typeOfRun'])['kappa'].mean()


要么

df[df.typeOfRun != 'validation'].describe()


无法产生预期的结果。
python -  Pandas 模糊群汇总统计-LMLPHP

AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation

最佳答案

IIUC您可以使用DataFrameGroupBy.describe

print (df.groupby(['modelName', 'typeOfRun']).describe())

                                             f1_R     kappa
modelName              typeOfRun
00_testing-validation  validation count  1.000000  1.000000
                                  mean   0.832731  0.097878
                                  std         NaN       NaN
                                  min    0.832731  0.097878
                                  25%    0.832731  0.097878
                                  50%    0.832731  0.097878
                                  75%    0.832731  0.097878
                                  max    0.832731  0.097878
00_testing_1-training  training   count  1.000000  1.000000
                                  mean   0.831213  0.099879
                                  std         NaN       NaN
                                  min    0.831213  0.099879
                                  25%    0.831213  0.099879
                                  50%    0.831213  0.099879
                                  75%    0.831213  0.099879
                                  max    0.831213  0.099879
00_testing_2-training  training   count  1.000000  1.000000
                                  mean   0.839183  0.101592
                                  std         NaN       NaN
...
...


您可以通过groupby创建的Series通过split并通过str[0]选择列表的第一项:

print (df.modelName.str.split('_').str[0])
0    00
1    00
2    00
3    00
4    01
5    01
6    01
7    01
Name: modelName, dtype: object

print (df.groupby([df.modelName.str.split('_').str[0]]).describe())
                    AUC_R  Accuracy  Error;rate  False;negative;rate  \
modelName
00        count  4.000000  4.000000    4.000000             4.000000
          mean   0.768913  0.722449    0.277551             0.181730
          std    0.004149  0.004924    0.004924             0.011270
          min    0.765376  0.718242    0.270494             0.165198
          25%    0.766368  0.719753    0.276280             0.179275
          50%    0.767757  0.721024    0.278976             0.186147
          75%    0.770302  0.723720    0.280247             0.188601
          max    0.774762  0.729506    0.281758             0.189427
01        count  4.000000  4.000000    4.000000             4.000000
          mean   0.771151  0.733358    0.266642             0.188704
          std    0.004452  0.003198    0.003198             0.011488
          min    0.767635  0.730054    0.262587             0.171806
          25%    0.768771  0.731325    0.264984             0.186674
          50%    0.769662  0.732983    0.267017             0.192937
          75%    0.772042  0.735016    0.268675             0.194968
          max    0.777643  0.737413    0.269946             0.197137
          ...
          ...

关于python - Pandas 模糊群汇总统计,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/39890417/

10-16 01:01