Skip to content Skip to sidebar Skip to footer

Adding A Column To Dask Dataframe, Computing It Through A Rolling Window

Suppose I have the following code, to generate a dummy dask dataframe: import pandas as pd import dask.dataframe as dd pandas_dataframe = pd.DataFrame({'A' : [0,500,1000], 'B': [-1

Solution 1:

I have tried to solve it through a closure. I will post benchmarks on some data, as soon as I have finalized the code. For now I have the following toy example, which seems to work: since dask dataframe's apply methods seems to be preserving the row order.

import numpy as np
import pandas as pd
import dask.dataframe as dd
number_of_components = 30


df = pd.DataFrame(np.random.randint(0,number_of_components,size=(number_of_components, 2)), columns=list('AB'))
my_data_frame = dd.from_pandas(df, npartitions = 1 )


def sumPrevious( previousState ) :

     def getValue(row):
        nonlocal previousState 
        something = row['A'] - previousState 
        previousState = row['A']
        return something

     return getValue


given_func = sumPrevious(1 )
out = my_data_frame.apply(given_func, axis = 1 , meta = float)
df['computed'] = out.compute()

Now the bad news, I have tried to abstract it out, passing the state around and using a rolling window of any width, through this new function:

def generalised_coupled_computation(previous_state , coupled_computation, previous_state_update) :

    def inner_function(actual_state):
        nonlocal previous_state
        actual_value = coupled_computation(actual_state , previous_state  )
        previous_state = previous_state_update(actual_state, previous_state)
        return actual_value

    return inner_function

Suppose we initialize the function with:

init_state = df.loc[0] 
coupled_computation  = lambda act,prev : act['A'] - prev['A']
new_update = lambda act, prev : act
given_func3 = generalised_coupled_computation(init_state , coupled_computation, new_update )
out3 = my_data_frame.apply(given_func3, axis = 1 , meta = float)

Try to run it and be ready for surprises: the first element is wrong, possibly some pointer's problems, given the odd result. Any insight?

Anyhow, if one passes primitive types, it seems to function.


Update:

the solution is in using copy:

import copy as copy

def new_update(act, previous):
    return copy.copy(act)

Now the functions behaves as expected; of course it is necessary to adapt the function updates and the coupled computation function if one needs a more coupled logic


Post a Comment for "Adding A Column To Dask Dataframe, Computing It Through A Rolling Window"