How to find intersection between two (or more) clusterings of the same dataset

Question

Say that, given a dataset X = (x_1, ..., x_n) with n instances of dimensionality d, I cluster all instances in X using two different clustering algorithms. This will result in two separate clusterings of the same dataset, C' and C''.

Is there a way to find the intersection between those two clusterings? That is, a third clustering C which considers (x_i, x_j) to be in the same cluster iff (x_i, x_j) belong to the same cluster both according to C' and to C''. (And if so, what is its complexity?)

In other words: C(x_i) = C(x_j) iff [C'(x_i) = C'(x_j) and C''(x_i) = C''(x_j)]

Moreover, if such a method exists, does it also extend to the case where the number of clusterings to compare is greater than two?

How do find the mapping between clusters in C' and C'' so that you can decide whether an instance belongs to the "same" cluster in both clusterings? — Pibben, Oct 14 '19 at 09:26
Are you looking for an algorithm to combine the results of two unsupervised clusterings? Without any further knowledge of the data, like shape? — Mobold, Oct 14 '19 at 10:09
@Pibben What I mean is C(x_i) = C(x_j) iff C'(x_i) = C'(x_j) and C''(x_i) = C''(x_j). There is no constraint such as C'(x_i) = C''(x_i). — Procope, Oct 14 '19 at 11:57

score 1 · Answer 1 · answered Oct 27 '19 at 08:06

Let C' have m clusters and C'' have k clusters.

Then a point x in clusters i=C'(x) and j=C''(x) is now put into cluster C(x)=i*k+j=C'(x)*k+C''(x).

Think of this as making a m*k matrix of clusters, and each clustering determining he row or column. Obviously this could be extended to tensors.

In fact many of the evaluation measures such as ARI and NMI work this way, except that they only count the intersection sizes.

You get up to m*k clusters, but some may be empty.

score 0 · Accepted Answer · answered Oct 14 '19 at 12:24

0

What you can do is to create a graph where the instances are the nodes and there are edges between the nodes if C'(x_i) = C'(x_j) and C''(x_i) = C''(x_j). This can be done in O(n^2).

Then you can find the connected components of the graph. This is also O(n^2).

The connected components of the graph are your final clusters.

answered Oct 14 '19 at 12:24

Pibben

1,876
14
28

But it can also be done in O(n). Also, connected components will give false results. You meant cliques. – Has QUIT--Anony-Mousse Oct 27 '19 at 08:03
Could you briefly elaborate on why connected components will give false results? – Procope Oct 28 '19 at 09:21

How to find intersection between two (or more) clusterings of the same dataset

2 Answers2