Skip to content Skip to sidebar Skip to footer

Pandas: Apply Function Over Each Pair Of Columns Under Constraints

As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form: Code |

Solution 1:

To apply the cosine metric to each pair from two collections of inputs, you could use scipy.spatial.distance.cdist. This will be much much faster than using a double Python loop.

Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:

import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]

Then all the cosine similarities can be computed with one call to cdist:

import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[  2.92893219e-01,   1.11022302e-16,   3.00000000e-01],
#        [  4.34314575e-01,   3.00000000e-01,   1.11022302e-16]])

The values can be wrapped in a new DataFrame and reshaped:

result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()

import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)

yields the Series

17  14    0.292893
    19    0.300000
19  14    0.434315
    17    0.300000

Post a Comment for "Pandas: Apply Function Over Each Pair Of Columns Under Constraints"