python - 如何计算周差异并在python Pandas 中添加带有计数的缺失周

我有这样的数据框，我必须丢失Weeks值并在它们之间进行计数

year    Data    Id
20180406    57170   A
20180413    55150   A
20180420    51109   A
20180427    57170   A
20180504    55150   A
20180525    51109   A

输出应该是这样的。

Id Start year end-year count
A  20180420      20180420      1
A  20180518      20180525      2

最佳答案

使用：

#converting to week period starts in Thursday
df['year'] = pd.to_datetime(df['year'], format='%Y%m%d').dt.to_period('W-Thu')
#resample by start of months with asfreq
df1 = (df.set_index('year')
         .groupby('Id')['Id']
         .resample('W-Thu')
         .asfreq()
         .rename('val')
         .reset_index())
print (df1)
  Id                  year  val
0  A 2018-04-06/2018-04-12    A
1  A 2018-04-13/2018-04-19    A
2  A 2018-04-20/2018-04-26    A
3  A 2018-04-27/2018-05-03    A
4  A 2018-05-04/2018-05-10    A
5  A 2018-05-11/2018-05-17  NaN
6  A 2018-05-18/2018-05-24  NaN
7  A 2018-05-25/2018-05-31    A

#onverting to datetimes with starts dates
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#converting-between-representations
df1['year'] = df1['year'].dt.to_timestamp('D', how='s')
print (df1)
  Id       year  val
0  A 2018-04-06    A
1  A 2018-04-13    A
2  A 2018-04-20    A
3  A 2018-04-27    A
4  A 2018-05-04    A
5  A 2018-05-11  NaN
6  A 2018-05-18  NaN
7  A 2018-05-25    A

m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()

#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
                     .agg(['first','last','size'])
                     .reset_index(level=1, drop=True)
                     .reset_index())

print (df2)
  Id      first       last  size
0  A 2018-05-11 2018-05-18     2