Skip to content Skip to sidebar Skip to footer

Pandas Expanding Mean With Group By And Before Current Row Date

I have a Pandas dataframe as follows df = pd.DataFrame([['John', '1/1/2017','10'], ['John', '2/2/2017','15'], ['John', '2/2/2017','20'],

Solution 1:

instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:

  • Customer == current row's Customer
  • Deposit_Date < current row's Deposit_Date

Use df.apply to perform this operation for all row in the dataframe:

df['PreviousMean'] = df.apply(
    lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(), 
axis=1)

outputs:

CustomerDeposit_DateDPDPreviousMean0John2017-01-01   10NaN1John2017-02-02   1510.02John2017-02-02   2010.03John2017-03-03   3015.04Sue2017-01-01   10NaN5Sue2017-02-02   1510.06Sue2017-03-02   2012.57Sue2017-03-03    715.08Sue2017-04-04   2013.0

Solution 2:

Here's one way to exclude repeated days from mean calculation:

# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])

# apply pd.expanding_meandf['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))

# drop helper seriesdf = df.drop('DPD2', 1)

print(df)

  Customer Name Deposit_Date  DPD  CumMean
0          John   01/01/2017   10     10.0
1          John   01/01/2017   10     10.0
2          John   02/02/2017   20     15.0
3          John   03/03/2017   30     20.0
4           Sue   01/01/2017   10     10.0
5           Sue   01/01/2017   10     10.0
6           Sue   02/02/2017   20     15.0
7           Sue   03/03/2017   30     20.0

Solution 3:

Ok here is the best solution I've come up with thus far.

The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.

s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)

s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)

df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])

Post a Comment for "Pandas Expanding Mean With Group By And Before Current Row Date"