Skip to content Skip to sidebar Skip to footer

Split The Data Frame Based On Consecutive Row Values Differences

I have a data frame like this, df col1 col2 col3 1 2 3 2 5 6 7 8 9 10 11 12 11 12 13 13 14 15 14 15

Solution 1:

You could define a custom grouper by taking the diff, checking when it is greater than 1, and take the cumsum of the boolean series. Then group by the result and build a dictionary from the groupby object:

d = dict(tuple(df.groupby(df.col1.diff().gt(1).cumsum())))

print(d[0])
   col1  col2  col3
01231256print(d[1])
   col1  col2  col3
2789

A more detailed break-down:

df.assign(difference=(diff:=df.col1.diff()), 
          condition=(gt1:=diff.gt(1)), 
          grouper=gt1.cumsum())

   col1  col2  col3  difference  condition  grouper
0123         NaN      False012561.0False027895.0True131011123.0True241112131.0False251314152.0True361415161.0False3

Solution 2:

You can also peel off the target column and work with it as a series, rather than the above answer. That keeps everything smaller. It runs faster on the example, but I don't know how they'll scale up, depending how many times you're splitting.

row_bool = df['col1'].diff()>1
split_inds, = np.where(row_bool)
split_inds = np.insert(arr=split_inds, obj=[0,len(split_inds)], values=[0,len(df)])

df_tup = ()
for n in range(0,len(split_inds)-1):
    tempdf = df.iloc[split_inds[n]:split_inds[n+1],:]
    df_tup.append(tempdf)

(Just throwing it in a tuple of dataframes afterward, but the dictionary approach might be better?)

Post a Comment for "Split The Data Frame Based On Consecutive Row Values Differences"