Skip to content Skip to sidebar Skip to footer

Is My Python Implementation Of The Davies-bouldin Index Correct?

I'm trying to calculate the Davies-Bouldin Index in Python. Here are the steps the code below tries to reproduce. 5 Steps: For each cluster, compute euclidean distances between

Solution 1:

Here is a shorter, faster corrected version of the Davies-Bouldin index naive implementation above.

def DaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k in range(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]
    variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
    db = []

    for i in range(n_cluster):
        for j in range(n_cluster):
            if j != i:
                db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))

    return(np.max(db) / n_cluster)

Answering my own questions:

  • the counter on the first draft (step 4) was correct BUT irrelevant
  • there's no need to normalise intra and inter cluster distances
  • there was a mistake when calculating Euclidean distances

Note you can find innovative approaches that try to improve this index, notably the "New Version of Davies-Bouldin Index" that replaces Euclidean distance by Cylindrical distance.

Solution 2:

thank you for your implementation. I just have one question: Is there missing a division in the last row. In the last step the value of max(db) should be divided by the implemented number of clusters.

def DaviesBouldin(Daten, DatenLabels):
n_cluster = len(np.bincount(DatenLabels)) 
cluster_k = [Daten[DatenLabels == k] for k in range(n_cluster)] 
centroids = [np.mean(k, axis = 0) for k in cluster_k] 
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)] # mittlere Entfernung zum jeweiligen Clusterzentrum
db = []

for i in range(n_cluster):
    for j in range(n_cluster):
        if j != i:
            db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]) / n_cluster)
return(np.max(db))

Maybe I oversee that division because I'm new to Python. But in my graphics (I'm iterating over a range of clusters) the value of DB.max is very low at the beginning and increases afterwards. After the Scaling by the number of clusters the graph looks better (high DB.max value at the beginning and constantly falling with increasing number of clusters).

Best regards

Solution 3:

Thanks for the code and revision - really helped me to get started. The shorter, faster version is not entirely correct. I amended it to correctly average the dispersion scores of the most similar cluster for each cluster.

See https://www.researchgate.net/publication/224377470_A_Cluster_Separation_Measure for original algorithm and explanation:

The DBI is the average of the similarity measures of each cluster with its most similar cluster.

defDaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k inrange(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]

    # calculate cluster dispersion
    S = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k inenumerate(cluster_k)]
    Ri = []

    for i inrange(n_cluster):
        Rij = []
        # establish similarity between each cluster and all other clustersfor j inrange(n_cluster):
            if j != i:
                r = (S[i] + S[j]) / euclidean(centroids[i], centroids[j])
                Rij.append(r)
         # select Ri value of most similar cluster
         Ri.append(max(Rij)) 

    # get mean of all Ri values    
    dbi = np.mean(Ri)

    return dbi

Solution 4:

This one is 20x faster than the below code, all computation is done in numpy.

import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform

defdb_index(X, y):
    """
    Davies-Bouldin index is an internal evaluation method for
    clustering algorithms. Lower values indicate tighter clusters that 
    are better separated.
    """# get unique labelsif y.ndim == 2:
        y = np.argmax(axis=1)
    uniqlbls = np.unique(y)
    n = len(uniqlbls)
    # pre-calculate centroid and sigma
    centroid_arr = np.empty((n, X.shape[1]))
    sigma_arr = np.empty((n,1))
    dbi_arr = np.empty((n,n))
    mask_arr = np.invert(np.eye(n, dtype='bool'))
    for i,k inenumerate(uniqlbls):
        Xk = X[np.where(y==k)[0],...]
        Ak = np.mean(Xk, axis=0)
        centroid_arr[i,...] = Ak
        sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
    # compute pairwise centroid distances, make diagonal elements non-zero
    centroid_pdist_arr = squareform(pdist(centroid_arr)) + np.eye(n)
    # compute pairwise sigma sums
    sigma_psum_arr = squareform(pdist(sigma_arr, lambda u,v: u+v))
    # divide 
    dbi_arr = np.divide(sigma_psum_arr, centroid_pdist_arr)
    # get mean of max of off-diagonal elements
    dbi_arr = np.where(mask_arr, dbi_arr, 0)
    dbi = np.mean(np.max(dbi_arr, axis=1))
    return dbi

Here's an implementation using numpy 1.14, scipy 1.1.0, and python 3. Not much computational speed improvement but should have a slightly smaller memory footprint.

import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform

defdb_index(X, y):
    """
    Davies-Bouldin index is an internal evaluation method for
    clustering algorithms. Lower values indicate tighter clusters that 
    are better separated.

    Arguments
    ----------
    X : 2D array (n_samples, embed_dim)
    Vector for each example.

    y : 1D array (n_samples,) or 2D binary array (n_samples, n_classes)
    True labels for each example.

    Returns
    ----------
    dbi : float
        Calculated Davies-Bouldin index.
    """# get unique labelsif y.ndim == 2:
        y = np.argmax(axis=1)
    uniqlbls = np.unique(y)
    n = len(uniqlbls)
    # pre-calculate centroid and sigma
    centroid_arr = np.empty((n, X.shape[1]))
    sigma_arr = np.empty(n)
    for i,k inenumerate(uniqlbls):
        Xk = X[np.where(y==k)[0],...]
        Ak = np.mean(Xk, axis=0)
        centroid_arr[i,...] = Ak
        sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
    # loop over non-duplicate cluster pairs
    dbi = 0for i inrange(n):
        max_Rij = 0for j inrange(n):
            if j != i:
                Rij = np.divide(sigma_arr[i] + sigma_arr[j], 
                                euclidean(centroid_arr[i,...], centroid_arr[j,...]))
                if Rij > max_Rij:
                    max_Rij = Rij
        dbi += max_Rij
    return dbi/n

Post a Comment for "Is My Python Implementation Of The Davies-bouldin Index Correct?"