Calculating Average Distance Of Nearest Neighbours In Pandas Dataframe
Solution 1:
It might be a bit overkill but you could use nearest neighbors from scikit
An example:
import numpy as np
from sklearn.neighbors import NearestNeighbors
import pandas as pd
defnn(x):
nbrs = NearestNeighbors(n_neighbors=2, algorithm='auto', metric='euclidean').fit(x)
distances, indices = nbrs.kneighbors(x)
return distances, indices
time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
#This has the index of the nearest neighbor in the group, as well as the distance
nns = df.drop('car', 1).groupby('time').apply(lambda x: nn(x.as_matrix()))
groups = df.groupby('time')
nn_rows = []
for i, nn_set inenumerate(nns):
group = groups.get_group(i)
for j, tup inenumerate(zip(nn_set[0], nn_set[1])):
nn_rows.append({'time': i,
'car': group.iloc[j]['car'],
'nearest_neighbour': group.iloc[tup[1][1]]['car'],
'euclidean_distance': tup[0][1]})
nn_df = pd.DataFrame(nn_rows).set_index('time')
Result:
car euclidean_distance nearest_neighbour
time011.4142143021.0000003031.00000021110.04987631310.04987612453.03772252553.0377224
(Note that at time 0, car 3's nearest neighbor is car 2. sqrt((217-216)**2 + 1)
is about 1.4142135623730951
while sqrt((218-217)**2 + 0) = 1
)
Solution 2:
use cdist
from scipy.spatial.distance
to get a matrix representing distance from each car to every other car. Since each car's distance to itself is 0, the diagonal elements are all 0.
example (for time == 0
):
X = df[df.time==0][['x','y']]
dist = cdist(X, X)
dist
array([[0. , 2.23606798, 1.41421356],
[2.23606798, 0. , 1. ],
[1.41421356, 1. , 0. ]])
Use np.argsort to get the indexes that would sort the distance-matrix. The first column is just the row number because the diagonal elements are 0.
idx = np.argsort(dist)
idx
array([[0, 2, 1],
[1, 2, 0],
[2, 1, 0]], dtype=int64)
Then, just pick out the cars & closest distances using the idx
dist[v[:,0], v[:,1]]
array([1.41421356, 1. , 1. ])
df[df.time==0].car.values[v[:,1]]
array([3, 3, 2], dtype=int64)
combine the above logic into a function that returns the required dataframe:
def closest(df):
X = df[['x', 'y']]
dist = cdist(X, X)
v = np.argsort(dist)
return df.assign(euclidean_distance=dist[v[:, 0], v[:, 1]],
nearest_neighbour=df.car.values[v[:, 1]])
& use it with groupby, finally dropping the index because the groupby-apply adds an additional index
df.groupby('time').apply(closest).reset_index(drop=True)
time x y car euclidean_distance nearest_neighbour
002161311.4142143102181221.0000003202171231.000000231280110110.049876341290109310.0498761521303453.03772256213256553.0377224
by the way your sample output is wrong for time 0. My answer & Bacon's answer both show the correct result
Post a Comment for "Calculating Average Distance Of Nearest Neighbours In Pandas Dataframe"