1

Doing Kmeans cluster analysis, how to I manually define a certain cluster-center? For example I want to say my cluster centers are [1,2,3] and [3,4,5] and now I want to cluster my vectors to the predefined centers.

something like kmeans.cluster_centers_ = [[1,2,3],[3,4,5]] ?

to work around my problem thats what I do atm:

number_of_clusters = len(vec)
kmeans = KMeans(number_of_clusters, init='k-means++', n_init=100)
kmeans.fit(vec)

it basically defines a cluster for each vector. But it takes ages to compute as I have thousands of vectors/sentences. There must be an option to set the vector coordinates directly as cluster coordinates without the need to compute them with the kmeans algorithm. (as the center outputs are basically the vector coordinates after i run the algorithm...)

Edit to be more specific about my task:

So what I do want is I have tonns of vectors ( generated from sentences) and now I want to cluster these. But imagine I have two columns of sentences and always want to sort a B column sentence to an A column sentence. Not A column sentences to each other. Thats why I want to set cluster centers for the A column vectors and afterwards predict the clostest B vectors to these Centers. Hope that makes sense

I am using sklearn kmeans atm

Adrian_G
  • 143
  • 1
  • 11
  • Your question is very vague. Are you implementing the clustering algorithm by yourself? Or are you using some kind of library? Do post a simple example of your code so we'll know whats going on – Eran Moshe Feb 13 '20 at 10:03
  • Generally, in clustering algorithms we do not initiate a cluster at a certain point, but randomizing them. I don't know if there's an option to do so, I've never tried. You can try to look at the documentation of the library you're working with to see if there is such an option – Eran Moshe Feb 13 '20 at 10:05
  • 1
    I think you are needing a different algorithm here. If you manually define where the clusters are, you're not exactly analyzing those clusters – Hymns For Disco Feb 13 '20 at 10:06
  • So what I do want is I have tonns of vectors ( generated from sentences) and now I want to cluster these. But imagine I have two columns of sentences and always want to sort a B column sentence to an A column sentence. Not A column sentences to each other. Thats why I want to set cluster centers for the A column vectors and afterwards predict the clostest B vectors to these Centers. Hope that makes sense – Adrian_G Feb 13 '20 at 10:10
  • Isn't that just a minimum distance problem? Simply cluster A column to n clusters. Then for each B sentence find the distance to the n A cluster centers. and select shortest distance? Not really a clustering problem. – Jason Chia Feb 13 '20 at 10:18
  • @HymnsForDisco I want to define clusterCENTERS and then cluster the new vector set to those centers. Hope that makes sense – Adrian_G Feb 13 '20 at 10:31
  • @Adrian_G you are using "cluster" as a verb here, what exactly do you mean by that? Do you want to assign every data point to one of the pre-existing clusters that you have defined? – Hymns For Disco Feb 13 '20 at 10:33
  • @HymnsForDisco exactly! sorry for the cunfusion – Adrian_G Feb 13 '20 at 10:35
  • @Adrian_G the way that k-means works, it simply assigns ever data-point to whichever cluster's center it is closest to. No need to invoke the whole algorithm here then, just use for loops to check which is the closest cluster center for each of your points. – Hymns For Disco Feb 13 '20 at 10:37
  • thanks I´ll give it a try :-) – Adrian_G Feb 13 '20 at 10:40

1 Answers1

3

I think I know what you want to do. So you want to manually select the centroids for k-Means with some known examples and then perform the clustering to assign the closests data points to your pre-defined centroids.

The parameter you are looking for is the k-Means initialization named as init see documentation.

I have prepared a small example that would do exactly this.

import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial import distance_matrix

# 5 datapoints with 3 features
data = [[1, 0, 0],
        [1, 0.2, 0],
        [0, 0, 1],
        [0, 0, 0.9],
        [1, 0, 0.1]]

X = np.array(data)

distance_matrix(X,X)

The pairwise distance matrix shows which examples are the closests.

> array([[0.        , 0.2       , 1.41421356, 1.3453624 , 0.1       ],
>       [0.2       , 0.        , 1.42828569, 1.36014705, 0.2236068 ],
>       [1.41421356, 1.42828569, 0.        , 0.1       , 1.3453624 ],
>       [1.3453624 , 1.36014705, 0.1       , 0.        , 1.28062485],
>       [0.1       , 0.2236068 , 1.3453624 , 1.28062485, 0.        ]])

you can select certain data points to be used as your initial centroids

centroid_idx = [0,2] # let data point 0 and 2 be our centroids
centroids = X[centroid_idx,:]
print(centroids) # [[1. 0. 0.]
                 # [0. 0. 1.]]

kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1) # just run one k-Means iteration so that the centroids are not updated

kmeans.fit(X)
kmeans.labels_

>>> array([0, 0, 1, 1, 0], dtype=int32)

As you can see k-Means labels the data points as expected. You might want to omit the max_iter parameter if you want your centroids to be updated.

53RT
  • 649
  • 3
  • 20
  • That is very precisely what I want to achieve! Thanks a lot for your time to answer. – Adrian_G Feb 18 '20 at 08:04
  • Would you happen to know, why the "kmeans.fit(X) kmeans.labels_" Command results in: "C:\Users\ga2943\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\k_means_.py:972: RuntimeWarning: Explicit initial center position passed: performing only one init in k-means instead of n_init=10 return_n_iter=True)" – Adrian_G Feb 18 '20 at 08:06
  • I saw your problem and the warnings you experienced now are already answered here https://stackoverflow.com/questions/28862334/k-means-with-selected-initial-centers – 53RT Feb 18 '20 at 08:43