Skip to content Skip to sidebar Skip to footer

In-group Time-to Event Counter

I'm trying to work through the methodology for churn prediction I found here: Let's say today is 1/6/2017. I have a pandas dataframe, df, that I want to add two columns to. df = p

Solution 1:

To get the days until the next event, we can add a column that backfills the date of the next event:

df['next_event'] = df['date'][df['is_event'] == 1]
df['next_event'] = df.groupby('id')['next_event'].transform(lambda x: x.fillna(method='bfill'))

We can then just subtract to get the days between the next event and each day:

df['next_event'] = df['next_event'].fillna(df['date'].iloc[-1] + pd.Timedelta(days=1))
df['time_to_next_event'] = (df['next_event']-df['date']).dt.days

To get the is_censored value for each day and each id, we can group by id, and then we can forward-fill based on the 'is_event' column for each group. Now, we just need the forward-filled values, since according to the definition above, the value of 'is_censored' should be 0 on the day of the event itself. So, we can compare the 'is_event' column to the forward-filled version of that column and set 'is_censored' to 1 each time we have a forward-filled value that wasn't in the original.

df['is_censored'] = (df.groupby('id')['is_event'].transform(lambda x: x.replace(0, method='ffill')) != df['is_event']).astype(int)
df = df.drop('next_event', axis=1)    

    In [343]: df
    Out[343]:
  iddate  is_event  time_to_next_event  is_censored
0  a 2017-01-01         0                   3            0
1  a 2017-01-02         0                   2            0
2  a 2017-01-03         0                   1            0
3  a 2017-01-04         1                   0            0
4  a 2017-01-05         1                   0            0
5  b 2017-01-01         0                   1            0
6  b 2017-01-02         1                   0            0
7  b 2017-01-03         0                   3            1
8  b 2017-01-04         0                   2            1
9  b 2017-01-05         0                   1            1

Solution 2:

To generalize the method for is_censored to include cases where an event happens more than once within each id, I wrote this:

df['is_censored2'] = 1

max_dates = df[df['is_event'] == 1].groupby('id',as_index=False)['date'].max()
max_dates.columns = ['id','max_date']
df = pd.merge(df,max_dates,on=['id'],how='left')

df['is_censored2'][df['date'] <= df['max_date']] = 0

It initializes the column at 1 then grabs the max date associated with an event within each id and populates a 0 in is_censored2 if there are any dates in id that are less than or equal to it.

Post a Comment for "In-group Time-to Event Counter"