问题描述
我需要从时间序列(每月频率)中计算std均值,但是我还需要从计算中排除不完整"年(少于12个月的飞蛾)
I need to calculate a std mean from a time series (monthly frequence), but i also need to exclude from the calculation the "incomplete" Years (with less then 12 moths)
Numpy/scipy的工作"版本:
Numpy/scipy "working" version :
import numpy as np
import scipy.stats as sts
url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
npdata = np.genfromtxt(url, skip_header=1)
unique_enso_year = [int(value) for value in set(npdata[:, 0])]
nin34 = np.zeros(len(unique_enso_year))
for ind, year in enumerate(unique_enso_year):
indexes = np.flatnonzero(npdata[:, 0]==year)
if len(indexes) == 12:
nin34[ind] = np.mean(npdata[indexes, 9])
else:
nin34[ind] = np.nan
nin34x = (nin34 - sts.nanmean(nin34)) / sts.nanstd(nin34)
array([[ 1.02250000e+00, 5.15000000e-01, -6.73333333e-01,
-7.02500000e-01, 1.16666667e-01, 1.32916667e+00,
-1.10333333e+00, -8.11666667e-01, 1.51666667e-01,
6.42500000e-01, 6.49166667e-01, 3.71666667e-01,
4.05000000e-01, -1.98333333e-01, -4.79166667e-01,
1.24666667e+00, -1.44166667e-01, -1.18166667e+00,
-8.89166667e-01, -2.51666667e-01, 7.36666667e-01,
3.02500000e-01, 3.83333333e-01, 1.19166667e-01,
1.70833333e-01, -5.25000000e-01, -7.35000000e-01,
3.75000000e-01, -4.50833333e-01, -8.30000000e-01,
-1.41666667e-02, nan]])
熊猫尝试:
import pandas as pd
from datetime import datetime
def parse(yr, mon):
date = datetime(year=int(yr), day=2, month=int(mon))
return date
url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
data = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse)
grouped = data.groupby(lambda x: x.year)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = grouped.transform(zscore)
print transformed['ANOM.3']
YR_MON
1982-01-02 -0.986922
1982-02-02 -1.179216
1982-03-02 -1.179216
1982-04-02 -0.885119
1982-05-02 -0.376105
1982-06-02 0.087664
1982-07-02 -0.161188
1982-08-02 0.098975
1982-09-02 0.415695
1982-10-02 1.049134
1982-11-02 1.286674
1982-12-02 1.829622
1983-01-02 1.715072
1983-02-02 1.428598
1983-03-02 0.976272
...
2012-03-02 -0.999284
2012-04-02 -0.663736
2012-05-02 -0.063283
2012-06-02 0.572491
2012-07-02 0.961020
2012-08-02 1.314227
2012-09-02 0.925699
2012-10-02 0.537170
2012-11-02 0.660793
2012-12-02 -0.169245
2013-01-02 -1.001483
2013-02-02 -0.924445
2013-03-02 0.462223
2013-04-02 1.386668
2013-05-02 0.077037
Name: ANOM.3, Length: 377, dtype: float64
这不是我想要的..因为也算2013年(只有5个月)
This is not what i want .. because count also 2013 (that has only 5 months)
要提取我想要的内容,我不需要做类似的事情:
To extract what i want i need t do something like :
(grouped.mean()['ANOM.3'][:-1] - sts.nanmean(grouped.mean()['ANOM.3'][:-1])) / sts.nanstd(grouped.mean()['ANOM.3'][:-1])
但是这假设我已经知道去年已经不完整了,然后我将np.NAN放到应该具有2013年值的位置
but this assume that i already k now that the last year was incomplete and then i loose the the np.NAN where i should have the 2013 value
所以我现在正尝试在像这样的熊猫中进行查询:
so i was now trying to make a query in pandas like :
grouped2 = data.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None).reset_index(drop=True)
那给了我正确的值" ..但是这产生了一个新的数据框没有带时间戳的索引" ..我敢肯定有一种简单而美观的方法来做..感谢您的帮助!
That gives me the "right values" .. but this generated a new dataframe "without index with timestamp" .. i'm sure there is a simply and beauty way to do it.. thanks for any help!
推荐答案
我是这样发现的:
import pandas as pd
url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
ts_raw = pd.read_table(url,
sep=' ',
header=0,
skiprows=0,
parse_dates = [['YR', 'MON']],
skipinitialspace=True,
index_col=0,
date_parser=parse)
ts_year_group = ts_raw.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None)
ts_range = pd.date_range(ts_year_group.index[0][1],
ts_year_group.index[-1][1]+pd.DateOffset(months=1),
freq="M")
ts = pd.DataFrame(ts_year_group.values,
index=ts_range,
columns=ts_year_group.keys())
ts_fullyears_group = ts.groupby(lambda x: x.year)
nin_anomalies = (grouped.mean()['ANOM.3'] - sts.nanmean(grouped.mean()['ANOM.3'])) / sts.nanstd(grouped.mean()['ANOM.3'])
nin_anomalies
1982 1.527215
1983 0.779877
1984 -0.970047
1985 -1.012997
1986 0.193297
1987 1.978809
1988 -1.603259
1989 -1.173755
1990 0.244837
1991 0.967632
1992 0.977449
1993 0.568807
1994 0.617893
1995 -0.270568
1996 -0.684120
1997 1.857320
1998 -0.190803
1999 -1.718612
2000 -1.287880
2001 -0.349106
2002 1.106301
2003 0.466953
2004 0.585987
2005 0.196978
2006 0.273062
2007 -0.751613
2008 -1.060856
2009 0.573715
2010 -0.642396
2011 -1.200752
2012 0.000633
Name: ANOM.3, dtype: float64
我敢肯定,有更好的方法可以做到这一点:/
i'm sure there is better way to do the same :/
这篇关于pandas数据框-从少于X行的组中删除值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!