Skip to content Skip to sidebar Skip to footer

How To Determine If Two Partitions (clusterings) Of Data Points Are Identical?

I have n data points in some arbitrary space and I cluster them. The result of my clustering algorithm is a partition represented by an int vector l of length n assigning each poin

Solution 1:

When are two partition identical?

Probably if they have the exact same members.

So if you just want to test for identity, you can do the following:

Substitute each partition ID with the smallest object ID in the partition.

Then two partitionings are identical if and only if this representation is identical.

In your example above, lets assume the vector index 1 .. 7 is your object ID. Then I would get the canonical form

[ 1114467 ]
  ^ first occurrence at pos1 of 1 in l_1 / 2 in l_2
        ^ first occurrence at pos4

for l_1 and l_2, whereas l_3 canonicalizes to

[ 1 1 1 4 4 6 6 ]

To make it more clear, here is another example:

l_4 = [ A B 0 D 0 B A ]

canonicalizes to

[ 1 2 3 4 3 2 1 ]

since the first occurence of cluster "A" is at position 1, "B" at position 2 etc.

If you want to measure how similar two clusterings are, a good approach is to look at precision/recall/f1 of the object pairs, where the pair (a,b) exists if and only if a and b belong to the same cluster.

Update: Since it was claimed that this is quadratic, I will further clarify.

To produce the canonical form, use the following approach (actual python code):

defcanonical_form(li):
  """ Note, this implementation overwrites li """
  first = dict()
  for i inrange(len(li)):
    v = first.get(li[i])
    if v isNone:
      first[li[i]] = i
      v = i
    li[i] = v
  return li

print canonical_form([ 1, 1, 1, 0, 0, 2, 6 ])
# [0, 0, 0, 3, 3, 5, 6]print canonical_form([ 2, 2, 2, 9, 9, 3, 1 ])
# [0, 0, 0, 3, 3, 5, 6]print canonical_form([ 2, 2, 2, 9, 9, 3, 3 ])
# [0, 0, 0, 3, 3, 5, 5]print canonical_form(['A','B',0,'D',0,'B','A'])
# [0, 1, 2, 3, 2, 1, 0]print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,1])
# Trueprint canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,3])
# False

Solution 2:

If you are going to relabel your partitions, as has been previously suggested, you will potentially need to search through n labels for each of the n items. I.e. the solutions are O(n^2).

Here is my idea: Scan through both lists simultaneously, maintaining a counter for each partition label in each list. You will need to be able to map partition labels to counter numbers. If the counters for each list do not match, then the partitions do not match. This would be O(n).

Here is a proof of concept in Python:

l_1 = [ 1, 1, 1, 0, 0, 2, 6 ]

l_2 = [ 2, 2, 2, 9, 9, 3, 1 ]

l_3 = [ 2, 2, 2, 9, 9, 3, 3 ]

d1 = dict()
d2 = dict()
c1 = []
c2 = []

# assume lists same length

match = Truefor i inrange(len(l_1)):
    if l_1[i] notin d1:
        x1 = len(c1)
        d1[l_1[i]] = x1
        c1.append(1)
    else:
        x1 = d1[l_1[i]]
        c1[x1] += 1if l_2[i] notin d2:
        x2 = len(c2)
        d2[l_2[i]] = x2
        c2.append(1)
    else:
        x2 = d2[l_2[i]]
        c2[x2] += 1if x1 != x2 or  c1[x1] != c2[x2]:
        match = Falseprint"match = {}".format(match)

Solution 3:

In matlab:

function tf = isIdenticalClust( l_1, l_2 )%
% checks if partitions l_1 and l_2 are identical or not
%
tf =all( accumarray({l_1}, l_2 ,[],@(x)all( x == x(1)))==1)&&...
     all( accumarray({l_2}, l_1 ,[],@(x)all( x == x(1)))==1);

What this does: groups all elements of l_1 according to the partition of l_2 and checks if all elements of l_1 at each cluster are all identical. Repeating the same for partitioning l_2 according to l_1. If both grouping yields the homogenous clusters - they are identical.

Post a Comment for "How To Determine If Two Partitions (clusterings) Of Data Points Are Identical?"