-1

Say that, given a dataset X = (x_1, ..., x_n) with n instances of dimensionality d, I cluster all instances in X using two different clustering algorithms. This will result in two separate clusterings of the same dataset, C' and C''.

Is there a way to find the intersection between those two clusterings? That is, a third clustering C which considers (x_i, x_j) to be in the same cluster iff (x_i, x_j) belong to the same cluster both according to C' and to C''. (And if so, what is its complexity?)

In other words: C(x_i) = C(x_j) iff [C'(x_i) = C'(x_j) and C''(x_i) = C''(x_j)]

Moreover, if such a method exists, does it also extend to the case where the number of clusterings to compare is greater than two?

Procope
  • 91
  • 1
  • 7
  • 1
    How do find the mapping between clusters in C' and C'' so that you can decide whether an instance belongs to the "same" cluster in both clusterings? – Pibben Oct 14 '19 at 09:26
  • 1
    Are you looking for an algorithm to combine the results of two unsupervised clusterings? Without any further knowledge of the data, like shape? – Mobold Oct 14 '19 at 10:09
  • @Pibben What I mean is C(x_i) = C(x_j) iff C'(x_i) = C'(x_j) and C''(x_i) = C''(x_j). There is no constraint such as C'(x_i) = C''(x_i). – Procope Oct 14 '19 at 11:57
  • @Mobold exactly – Procope Oct 14 '19 at 11:58

2 Answers2

1

Let C' have m clusters and C'' have k clusters.

Then a point x in clusters i=C'(x) and j=C''(x) is now put into cluster C(x)=i*k+j=C'(x)*k+C''(x).

Think of this as making a m*k matrix of clusters, and each clustering determining he row or column. Obviously this could be extended to tensors.

In fact many of the evaluation measures such as ARI and NMI work this way, except that they only count the intersection sizes.

You get up to m*k clusters, but some may be empty.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

What you can do is to create a graph where the instances are the nodes and there are edges between the nodes if C'(x_i) = C'(x_j) and C''(x_i) = C''(x_j). This can be done in O(n^2).

Then you can find the connected components of the graph. This is also O(n^2).

The connected components of the graph are your final clusters.

Pibben
  • 1,876
  • 14
  • 28