我有一些数据,下面提供信息,

df.info() is below,

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6662 entries, 0 to 6661
Data columns (total 2 columns):
value      6662 non-null float64
country    6478 non-null object
dtypes: float64(1), object(1)
memory usage: 156.1+ KB
None


list of the columns,
[u'value' 'country']


the df is below,

        value country
0     550.00     USA
1     118.65   CHINA
2     120.82   CHINA
3      86.82   CHINA
4     112.14   CHINA
5     113.59   CHINA
6     114.31   CHINA
7     111.42   CHINA
8     117.21   CHINA
9     111.42   CHINA

--------------------
--------------------
6655  500.00     USA
6656  500.00     USA
6657  390.00     USA
6658  450.00     USA
6659  420.00     USA
6660  420.00     USA
6661  450.00     USA


我需要添加另一列即outlier并放入1
如果数据是该国家/地区的离群值,
否则,我需要输入0。我强调outlier将需要为相应国家/地区而不是全部国家/地区计算。

我找到了一些计算异常值的公式,例如,

# keep only the ones that are within +3 to -3 standard
def exclude_the_outliers(df):
    df = df[np.abs(df.col - df.col.mean())<=(3*df.col.std())]
    return df


def exclude_the_outliers_extra(df):

    LOWER_LIMIT = .35
    HIGHER_LIMIT = .70

    filt_df = df.loc[:, df.columns == 'value']

    # Then, computing percentiles.
    quant_df = filt_df.quantile([LOWER_LIMIT, HIGHER_LIMIT])

    # Next filtering values based on computed percentiles. To do that I use
    # an apply by columns and that's it !
    filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[LOWER_LIMIT,x.name]) &
                                        (x < quant_df.loc[HIGHER_LIMIT,x.name])], axis=0)

    filt_df = pd.concat([df.loc[:, df.columns !=  'value'], filt_df], axis=1)
    filt_df.dropna(inplace=True)
    return df


我无法为此目的正确使用这些公式,但作为建议。
最后,我需要计算出离群值的百分比
数据中显示了美国和中国。

如何实现呢?

注意:在outlier列中全为零很容易
pasdas,应该像这样,

df['outlier'] = 0


但是,找到outlier并覆盖
该国家/地区的1为零。

最佳答案

您可以按每个国家对数据框进行切片,计算切片的分位数,并在该国家的索引处设置outlier的值。

可能有一种无需迭代的方法,但这超出了我的范围。

# using True/False for the outlier, it is the same as 1/0
df['outlier'] = False

# set the quantile limits
low_q = 0.35
high_q = 0.7

# iterate over each country
for c in df.country.unique():
    # subset the dataframe where the country = c, get the quantiles
    q = df.value[df.country==c].quantile([low_q, high_q])
    # at the row index where the country column equals `c` and the column is `outlier`
    # set the value to be true or false based on if the `value` column is within
    # the quantiles
    df.loc[df.index[df.country==c], 'outlier'] = (df.value[df.country==c]
        .apply(lambda x: x<q[low_q] or x>q[high_q]))


编辑:要获取每个国家/地区离群值的百分比,您可以对“国家/地区”列进行分组并使用平均值进行汇总。

gb = df[['country','outlier']].groupby('country').mean()
for row in gb.itertuples():
    print('Percentage of outliers for {: <12}: {:.1f}%'.format(row[0], 100*row[1]))

# output:
# Percentage of outliers for China       : 54.0%
# Percentage of outliers for USA         : 56.0%

关于python - 如何修改Pandas DataFrame并插入新列,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/42678806/

10-11 21:50