位数

关注(28)粉丝(399)

四分位数与pandas中的quantile函数

四分位数与pandas中的quantile函数

1.分位数概念

统计学上的有分位数这个概念，一般用p来表示。原则上p是可以取0到1之间的任意值的。但是有一个四分位数是p分位数中较为有名的。

所谓四分位数；即把数值由小到大排列并分成四等份，处于三个分割点位置的数值就是四分位数。

为了更一般化，在计算的过程中，我们考虑p分位。当p=0.25 0.5 0.75 时，就是在计算四分位数。

第1四分位数 (Q1)，又称“较小四分位数”，等于该样本中所有数值由小到大排列后第25%的数字。
第2四分位数 (Q2)，又称“中位数”，等于该样本中所有数值由小到大排列后第50%的数字。
第3四分位数 (Q3)，又称“较大四分位数”，等于该样本中所有数值由小到大排列后第75%的数字。

2.计算方法

1）确定p分位数的位置（有两种方法）：

方法1 pos = (n+1)*p

方法2 pos = 1+(n-1)*p（pandas 中使用的是方法2）

2）计算分位数，一般有五种方法，pandas里面的quantile函数中，interpolation参数来控制（见后）

3.quantile函数

pandas库quantile函数可以很方便的帮助我们进行分位数的计算。

DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation=’linear’)

常用参数：

q : 数字或者是类列表，范围只能在0-1之间，默认是0.5，即中位数-第2四分位数

axis :计算方向，可以是 {0, 1, ‘index’, ‘columns’}中之一，默认为 0

interpolation（插值方法）:可以是 {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}之一，默认是linear。

这五个插值方法是这样的：当选中的分为点位于两个数数据点 i and j 之间时:

linear: i + (j - i) * fraction, fraction由计算得到的pos的小数部分（后面有例子）；
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.

举例

import pandas as pd

df=pd.read_csv('data/练习.csv')

df.sort_values("Height")

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

0	1101	2
3	1201	4
2	1103	5
1	1102	7
4	1203	8
5	1205	12

参数q默认为0.5（中位数）

df['Height'].quantile()

6.0

参数interpolation的不同方法

df['Height'].quantile(q=0.5,interpolation="linear")

6.0

df['Height'].quantile(q=0.5,interpolation="lower")

df['Height'].quantile(q=0.5,interpolation="higher")

df['Height'].quantile(q=0.5,interpolation="midpoint")

6.0

df['Height'].quantile(q=0.5,interpolation="nearest")

说明：df['Height']中一共有6个数据，中位数的位置pos=1+(6-1)*0.5=3.5,这个位置介于5和7之间，则i=5,j=7,fraction=0.5

linear:i + (j - i) * fraction=5+(7-5)*0.5=6
lower:i=5
higher:j=7
midpoint:(i+j)/2=(5+7)/2=6
nearest:5更接近(这个没太搞懂，貌似是fraction更靠近的那个整数)

参数q为列表类型，计算四分位数

df['Height'].quantile([0.25,0.5,0.75])

0.25    4.25

0.50    6.00

0.75    7.75

Name: Height, dtype: float64

05-27 10:10