我有一些数据,下面提供信息,
df.info() is below,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6662 entries, 0 to 6661
Data columns (total 2 columns):
value 6662 non-null float64
country 6478 non-null object
dtypes: float64(1), object(1)
memory usage: 156.1+ KB
None
list of the columns,
[u'value' 'country']
the df is below,
value country
0 550.00 USA
1 118.65 CHINA
2 120.82 CHINA
3 86.82 CHINA
4 112.14 CHINA
5 113.59 CHINA
6 114.31 CHINA
7 111.42 CHINA
8 117.21 CHINA
9 111.42 CHINA
--------------------
--------------------
6655 500.00 USA
6656 500.00 USA
6657 390.00 USA
6658 450.00 USA
6659 420.00 USA
6660 420.00 USA
6661 450.00 USA
我需要添加另一列即
outlier
并放入1如果数据是该国家/地区的离群值,
否则,我需要输入0。我强调
outlier
将需要为相应国家/地区而不是全部国家/地区计算。我找到了一些计算异常值的公式,例如,
# keep only the ones that are within +3 to -3 standard
def exclude_the_outliers(df):
df = df[np.abs(df.col - df.col.mean())<=(3*df.col.std())]
return df
def exclude_the_outliers_extra(df):
LOWER_LIMIT = .35
HIGHER_LIMIT = .70
filt_df = df.loc[:, df.columns == 'value']
# Then, computing percentiles.
quant_df = filt_df.quantile([LOWER_LIMIT, HIGHER_LIMIT])
# Next filtering values based on computed percentiles. To do that I use
# an apply by columns and that's it !
filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[LOWER_LIMIT,x.name]) &
(x < quant_df.loc[HIGHER_LIMIT,x.name])], axis=0)
filt_df = pd.concat([df.loc[:, df.columns != 'value'], filt_df], axis=1)
filt_df.dropna(inplace=True)
return df
我无法为此目的正确使用这些公式,但作为建议。
最后,我需要计算出离群值的百分比
数据中显示了美国和中国。
如何实现呢?
注意:在
outlier
列中全为零很容易pasdas,应该像这样,
df['outlier'] = 0
但是,找到
outlier
并覆盖该国家/地区的
1
为零。 最佳答案
您可以按每个国家对数据框进行切片,计算切片的分位数,并在该国家的索引处设置outlier
的值。
可能有一种无需迭代的方法,但这超出了我的范围。
# using True/False for the outlier, it is the same as 1/0
df['outlier'] = False
# set the quantile limits
low_q = 0.35
high_q = 0.7
# iterate over each country
for c in df.country.unique():
# subset the dataframe where the country = c, get the quantiles
q = df.value[df.country==c].quantile([low_q, high_q])
# at the row index where the country column equals `c` and the column is `outlier`
# set the value to be true or false based on if the `value` column is within
# the quantiles
df.loc[df.index[df.country==c], 'outlier'] = (df.value[df.country==c]
.apply(lambda x: x<q[low_q] or x>q[high_q]))
编辑:要获取每个国家/地区离群值的百分比,您可以对“国家/地区”列进行分组并使用平均值进行汇总。
gb = df[['country','outlier']].groupby('country').mean()
for row in gb.itertuples():
print('Percentage of outliers for {: <12}: {:.1f}%'.format(row[0], 100*row[1]))
# output:
# Percentage of outliers for China : 54.0%
# Percentage of outliers for USA : 56.0%
关于python - 如何修改Pandas DataFrame并插入新列,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/42678806/