Skip to content Skip to sidebar Skip to footer

Calculating Average Distance Of Nearest Neighbours In Pandas Dataframe

I have a set of objects and their positions over time. I would like to get the distance between each car and their nearest neighbour, and calculate an average of this for each time

Solution 1:

It might be a bit overkill but you could use nearest neighbors from scikit

An example:

import numpy as np 
from sklearn.neighbors import NearestNeighbors
import pandas as pd

defnn(x):
    nbrs = NearestNeighbors(n_neighbors=2, algorithm='auto', metric='euclidean').fit(x)
    distances, indices = nbrs.kneighbors(x)
    return distances, indices

time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56] 
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})

#This has the index of the nearest neighbor in the group, as well as the distance
nns = df.drop('car', 1).groupby('time').apply(lambda x: nn(x.as_matrix()))

groups = df.groupby('time')
nn_rows = []
for i, nn_set inenumerate(nns):
    group = groups.get_group(i)
    for j, tup inenumerate(zip(nn_set[0], nn_set[1])):
        nn_rows.append({'time': i,
                        'car': group.iloc[j]['car'],
                        'nearest_neighbour': group.iloc[tup[1][1]]['car'],
                        'euclidean_distance': tup[0][1]})

nn_df = pd.DataFrame(nn_rows).set_index('time')

Result:

      car  euclidean_distance  nearest_neighbour
time011.4142143021.0000003031.00000021110.04987631310.04987612453.03772252553.0377224

(Note that at time 0, car 3's nearest neighbor is car 2. sqrt((217-216)**2 + 1) is about 1.4142135623730951 while sqrt((218-217)**2 + 0) = 1)

Solution 2:

use cdist from scipy.spatial.distance to get a matrix representing distance from each car to every other car. Since each car's distance to itself is 0, the diagonal elements are all 0.

example (for time == 0):

X = df[df.time==0][['x','y']]
dist = cdist(X, X)
dist
array([[0.        , 2.23606798, 1.41421356],
       [2.23606798, 0.        , 1.        ],
       [1.41421356, 1.        , 0.        ]])

Use np.argsort to get the indexes that would sort the distance-matrix. The first column is just the row number because the diagonal elements are 0.

idx = np.argsort(dist)
idx
array([[0, 2, 1],
       [1, 2, 0],
       [2, 1, 0]], dtype=int64)

Then, just pick out the cars & closest distances using the idx

dist[v[:,0], v[:,1]]
array([1.41421356, 1.        , 1.        ])

df[df.time==0].car.values[v[:,1]]
array([3, 3, 2], dtype=int64)

combine the above logic into a function that returns the required dataframe:

 def closest(df):
     X = df[['x', 'y']]
     dist = cdist(X, X)
     v = np.argsort(dist)
     return df.assign(euclidean_distance=dist[v[:, 0], v[:, 1]],
                      nearest_neighbour=df.car.values[v[:, 1]])

& use it with groupby, finally dropping the index because the groupby-apply adds an additional index

df.groupby('time').apply(closest).reset_index(drop=True)

   time    x    y  car  euclidean_distance  nearest_neighbour
002161311.4142143102181221.0000003202171231.000000231280110110.049876341290109310.0498761521303453.03772256213256553.0377224

by the way your sample output is wrong for time 0. My answer & Bacon's answer both show the correct result

Post a Comment for "Calculating Average Distance Of Nearest Neighbours In Pandas Dataframe"