我有一个从CSV定义的数据框,想计算基本摘要统计信息,例如所有模型的训练部分的均值,方差...
插入型号并按其分组会很好-但这似乎不是一个好的解决方案。
我如何获取每个模型的摘要统计信息(仅用于培训),因为group_by modelName由于计数器不起作用。
df.groupby(['modelName', 'typeOfRun'])['kappa'].mean()
要么
df[df.typeOfRun != 'validation'].describe()
无法产生预期的结果。
AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation
最佳答案
IIUC您可以使用DataFrameGroupBy.describe
:
print (df.groupby(['modelName', 'typeOfRun']).describe())
f1_R kappa
modelName typeOfRun
00_testing-validation validation count 1.000000 1.000000
mean 0.832731 0.097878
std NaN NaN
min 0.832731 0.097878
25% 0.832731 0.097878
50% 0.832731 0.097878
75% 0.832731 0.097878
max 0.832731 0.097878
00_testing_1-training training count 1.000000 1.000000
mean 0.831213 0.099879
std NaN NaN
min 0.831213 0.099879
25% 0.831213 0.099879
50% 0.831213 0.099879
75% 0.831213 0.099879
max 0.831213 0.099879
00_testing_2-training training count 1.000000 1.000000
mean 0.839183 0.101592
std NaN NaN
...
...
您可以通过
groupby
创建的Series
通过split
并通过str[0]
选择列表的第一项:print (df.modelName.str.split('_').str[0])
0 00
1 00
2 00
3 00
4 01
5 01
6 01
7 01
Name: modelName, dtype: object
print (df.groupby([df.modelName.str.split('_').str[0]]).describe())
AUC_R Accuracy Error;rate False;negative;rate \
modelName
00 count 4.000000 4.000000 4.000000 4.000000
mean 0.768913 0.722449 0.277551 0.181730
std 0.004149 0.004924 0.004924 0.011270
min 0.765376 0.718242 0.270494 0.165198
25% 0.766368 0.719753 0.276280 0.179275
50% 0.767757 0.721024 0.278976 0.186147
75% 0.770302 0.723720 0.280247 0.188601
max 0.774762 0.729506 0.281758 0.189427
01 count 4.000000 4.000000 4.000000 4.000000
mean 0.771151 0.733358 0.266642 0.188704
std 0.004452 0.003198 0.003198 0.011488
min 0.767635 0.730054 0.262587 0.171806
25% 0.768771 0.731325 0.264984 0.186674
50% 0.769662 0.732983 0.267017 0.192937
75% 0.772042 0.735016 0.268675 0.194968
max 0.777643 0.737413 0.269946 0.197137
...
...
关于python - Pandas 模糊群汇总统计,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/39890417/