10

I have a numpy text file array at: https://github.com/alvations/anythingyouwant/blob/master/WN_food.matrix

It's a distance matrix between terms and each other, my list of terms are as such: http://pastebin.com/2xGt7Xjh

I used the follow code to generate a hierarchical cluster:

import numpy as np
from sklearn.cluster import AgglomerativeClustering

matrix = np.loadtxt('WN_food.matrix')
n_clusters = 518
model = AgglomerativeClustering(n_clusters=n_clusters,
                                linkage="average", affinity="cosine")
model.fit(matrix)

To get the clusters for each term, I could have done:

for term, clusterid in enumerate(model.labels_):
    print term, clusterid

But how do I traverse the tree that the AgglomerativeClustering outputs?

Is it possible to convert it into a scipy dendrogram (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html)? And after that how do I traverse the dendrogram?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • The [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) would suggest looking at the `children_` attribute of `model`. – jme Dec 09 '14 at 19:41
  • i used children_ and it's giving me lists of two nodes, it's not traversing but returning childrens and i've no idea what that is and the node number from the children goes beyond my number of nodes... – alvas Dec 10 '14 at 10:29
  • 1
    A full hierarchical clustering of `n` objects produces a tree with `2n - 1` nodes. As the documentation says: "Values less than n_samples refer to leaves of the tree. A greater value i indicates a node with children children_[i - n_samples]". That should be sufficient information to traverse the tree. – jme Dec 10 '14 at 16:30
  • @jme, pardon the noobie-ness, but what does the documentation mean? – alvas Dec 10 '14 at 16:38
  • 2
    Each node in the tree is assigned an ID, `i`. If the ID is less than the number of input objects `n_samples`, then the node is a leaf. Otherwise it's an internal node, and it joins two other nodes. The two nodes joined by node `i` are found in `children_[i - n_samples]`. As an aside, if your goal is to convert this to a scipy dendrogram, why not just use [`scipy.cluster.hierarchy.linkage`](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage) rather than `sklearn`? – jme Dec 10 '14 at 17:14
  • I'll answer this last question since I found this page because of the problem. scipy.cluster.hierarchy.linkage has a problem with throwing seg faults with large distance matrices. Apparently it is a known problem since 2012 that no one is interested in fixing. https://github.com/scipy/scipy/issues/2089 – Dave Kincaid May 02 '15 at 15:15

2 Answers2

23

I've answered a similar question for sklearn.cluster.ward_tree: How do you visualize a ward tree from sklearn.cluster.ward_tree?

AgglomerativeClustering outputs the tree in the same way, in the children_ attribute. Here's an adaptation of the code in the ward tree question for AgglomerativeClustering. It outputs the structure of the tree in the form (node_id, left_child, right_child) for each node of the tree.

import numpy as np
from sklearn.cluster import AgglomerativeClustering
import itertools

X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
model = AgglomerativeClustering(linkage="average", affinity="cosine")
model.fit(X)

ii = itertools.count(X.shape[0])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in model.children_]

https://stackoverflow.com/a/26152118

Community
  • 1
  • 1
A.P.
  • 1,109
  • 8
  • 6
  • 1
    is there a way to know which item is in the node? – alvas Dec 16 '14 at 14:11
  • 1
    how does it relate to the cluster labels `model.labels_`? – alvas Dec 16 '14 at 14:20
  • 1
    The node number is also an index for the data vector for each leaf of the tree. For example {'left': 1, 'right': 2, 'node_id': 10} node 10 has leaves 1 and 2 as children. X[1] is the data vector for leaf 1. – A.P. Dec 17 '14 at 01:00
  • 3
    You can also do `dict(enumerate(model.children_, model.n_leaves_))`, which will give you a dictionary where the each key is the ID of a node and the value is the pair of IDs of its children. – user76284 Oct 02 '16 at 18:04
4

Adding to A.P.'s answer, here is code that will give you a dictionary of membership. member[node_id] gives all the data point indices (zero to n).

on_split is a simple reformat of A.P's clusters that give the two clusters that form when node_id is split.

up_merge tells what node_id merges into and what node_id must be combined to merge into that.

ii = itertools.count(data_x.shape[0])
clusters = [{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in fit_cluster.children_]

import copy
n_points = data_x.shape[0]
members = {i:[i] for i in range(n_points)}
for cluster in clusters:
    node_id = cluster["node_id"]
    members[node_id] = copy.deepcopy(members[cluster["left"]])
    members[node_id].extend(copy.deepcopy(members[cluster["right"]]))

on_split = {c["node_id"]: [c["left"], c["right"]] for c in clusters}
up_merge = {c["left"]: {"into": c["node_id"], "with": c["right"]} for c in clusters}
up_merge.update({c["right"]: {"into": c["node_id"], "with": c["left"]} for c in clusters})
David Bernat
  • 324
  • 2
  • 11