Is My Python Implementation Of The Davies-bouldin Index Correct?
Solution 1:
Here is a shorter, faster corrected version of the Davies-Bouldin index naive implementation above.
def DaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))
return(np.max(db) / n_cluster)
Answering my own questions:
- the counter on the first draft (step 4) was correct BUT irrelevant
- there's no need to normalise intra and inter cluster distances
- there was a mistake when calculating Euclidean distances
Note you can find innovative approaches that try to improve this index, notably the "New Version of Davies-Bouldin Index" that replaces Euclidean distance by Cylindrical distance.
Solution 2:
thank you for your implementation. I just have one question: Is there missing a division in the last row. In the last step the value of max(db) should be divided by the implemented number of clusters.
def DaviesBouldin(Daten, DatenLabels):
n_cluster = len(np.bincount(DatenLabels))
cluster_k = [Daten[DatenLabels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)] # mittlere Entfernung zum jeweiligen Clusterzentrum
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]) / n_cluster)
return(np.max(db))
Maybe I oversee that division because I'm new to Python. But in my graphics (I'm iterating over a range of clusters) the value of DB.max is very low at the beginning and increases afterwards. After the Scaling by the number of clusters the graph looks better (high DB.max value at the beginning and constantly falling with increasing number of clusters).
Best regards
Solution 3:
Thanks for the code and revision - really helped me to get started. The shorter, faster version is not entirely correct. I amended it to correctly average the dispersion scores of the most similar cluster for each cluster.
See https://www.researchgate.net/publication/224377470_A_Cluster_Separation_Measure for original algorithm and explanation:
The DBI is the average of the similarity measures of each cluster with its most similar cluster.
defDaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k inrange(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
# calculate cluster dispersion
S = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k inenumerate(cluster_k)]
Ri = []
for i inrange(n_cluster):
Rij = []
# establish similarity between each cluster and all other clustersfor j inrange(n_cluster):
if j != i:
r = (S[i] + S[j]) / euclidean(centroids[i], centroids[j])
Rij.append(r)
# select Ri value of most similar cluster
Ri.append(max(Rij))
# get mean of all Ri values
dbi = np.mean(Ri)
return dbi
Solution 4:
This one is 20x faster than the below code, all computation is done in numpy.
import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform
defdb_index(X, y):
"""
Davies-Bouldin index is an internal evaluation method for
clustering algorithms. Lower values indicate tighter clusters that
are better separated.
"""# get unique labelsif y.ndim == 2:
y = np.argmax(axis=1)
uniqlbls = np.unique(y)
n = len(uniqlbls)
# pre-calculate centroid and sigma
centroid_arr = np.empty((n, X.shape[1]))
sigma_arr = np.empty((n,1))
dbi_arr = np.empty((n,n))
mask_arr = np.invert(np.eye(n, dtype='bool'))
for i,k inenumerate(uniqlbls):
Xk = X[np.where(y==k)[0],...]
Ak = np.mean(Xk, axis=0)
centroid_arr[i,...] = Ak
sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
# compute pairwise centroid distances, make diagonal elements non-zero
centroid_pdist_arr = squareform(pdist(centroid_arr)) + np.eye(n)
# compute pairwise sigma sums
sigma_psum_arr = squareform(pdist(sigma_arr, lambda u,v: u+v))
# divide
dbi_arr = np.divide(sigma_psum_arr, centroid_pdist_arr)
# get mean of max of off-diagonal elements
dbi_arr = np.where(mask_arr, dbi_arr, 0)
dbi = np.mean(np.max(dbi_arr, axis=1))
return dbi
Here's an implementation using numpy 1.14, scipy 1.1.0, and python 3. Not much computational speed improvement but should have a slightly smaller memory footprint.
import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform
defdb_index(X, y):
"""
Davies-Bouldin index is an internal evaluation method for
clustering algorithms. Lower values indicate tighter clusters that
are better separated.
Arguments
----------
X : 2D array (n_samples, embed_dim)
Vector for each example.
y : 1D array (n_samples,) or 2D binary array (n_samples, n_classes)
True labels for each example.
Returns
----------
dbi : float
Calculated Davies-Bouldin index.
"""# get unique labelsif y.ndim == 2:
y = np.argmax(axis=1)
uniqlbls = np.unique(y)
n = len(uniqlbls)
# pre-calculate centroid and sigma
centroid_arr = np.empty((n, X.shape[1]))
sigma_arr = np.empty(n)
for i,k inenumerate(uniqlbls):
Xk = X[np.where(y==k)[0],...]
Ak = np.mean(Xk, axis=0)
centroid_arr[i,...] = Ak
sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
# loop over non-duplicate cluster pairs
dbi = 0for i inrange(n):
max_Rij = 0for j inrange(n):
if j != i:
Rij = np.divide(sigma_arr[i] + sigma_arr[j],
euclidean(centroid_arr[i,...], centroid_arr[j,...]))
if Rij > max_Rij:
max_Rij = Rij
dbi += max_Rij
return dbi/n
Post a Comment for "Is My Python Implementation Of The Davies-bouldin Index Correct?"