2

I am completely redrafting this question following the advice of @MrFlick.

Assume I have a data.frame like the following

set.seed(1)

group<-(rep(1:10, sample(50:200, 10, replace=T)))
gender<-factor((sample(0:1, 1328, replace=T, prob=c(0.55, 0.45))))
country<-factor((sample(6030:6098, 1328, replace=T)))
ethnicity<-factor((sample(7040:7101, 1328, replace=T)))
yearbirth<-(sample(1950:1986, 1328, replace=T))
df<-data.frame(group, gender, country, ethnicity, yearbirth)

For each group, I would like to calculate the Silhouette Width (SW) corresponding to the 'optimal' number of clusters. To do so, I prepared the following function which I would like to repeat on any group

library(cluster)
library(fpc)

ASW<-function(x){

  x<-as.data.frame(x)
  id<-as.integer(x[1,1])
  people<-length(as.vector(x[,1]))
  if (people==1){
    p=0
  } else {
    x<-x[,-1]
    diss<-daisy(x, metric="gower")
    if (people/3<2) {
      maxclus=2      
    } else {
      maxclus<-round(people/3)
    }  
    asw <- numeric(maxclus)
    for (k in 2:maxclus) asw[[k]] <- pam(diss, k, diss=T) $ silinfo $ avg.width
    k.best <- which.max(asw)
    p<-asw[k.best]
  }  
  swg<-numeric(2)
  swg[1]<-id
  swg[2]<-p
  swg
}

As a final output, I would like ASW to produce a data.frame having the group number (id) in the first column and the Silhouette Width value corresponding to the optimal number of clusters in the second. If the group contains only one individual, I would like Silhouette Width to be 0 - SW is not defined for less than 2 clusters. Using all variables except for group I would like to compute a dissimilarity matrix using daisy from the cluster package. To my knowledge, daisy is the only function capable to compute a dissimilarity matrix from mixed variables. Next, I would pass the dissimilarity matrix just produced to pam and calculate the Silhouette Width for various cluster configurations. To shorten the computing time, especially with large groups, I am imposing a maximum number of clusters equal to one-third the number of individuals in the group. At this point, I would like the function to take the SW value corresponding to the optimal number of clusters (determined by looking at the maximum Silhouette Width value) and paste it, together with the corresponding group id, in a data.frame - here called aswout.

Unfortunately, the function seems not to work properly (I tried it on the first group only) and it's not so clear to me how to get it 'cycle' over all the groups.

I hope the question is clear. Write if there is something you don´t understand and I will add more information. I am really thankful for any help on this!

All the best, Riccardo

EDIT:

The ASW function now works. I am trying to make it cycle over all groups in a data frame. I learned from another post that it's a bad habit to include data.frames within functions that are grown as the function executes. This however was the aim of my aswout data.frame. I am now looking for a way to achieve the same result, having the function loop over the groups and giving me an output data.frame, without including the data.frame within the function.

www
  • 38,575
  • 12
  • 48
  • 84
Riccardo
  • 743
  • 2
  • 5
  • 14
  • Do you know exactly what code you would write to implement the "hierarchical clustering with Ward method and to select the optimum number of clusters I would like to use the Calinski–Harabasz maximum Pseudo-F" in R? If your only problem is that you're not sure how to loop over the groups, then it would be helpful if you shared code that worked for just one group, then someone can help translate that to something that would work for all groups. – MrFlick Jul 18 '14 at 21:25
  • @MrFlick Not at the moment, but I am working on that to edit my question and add some elements. In this way I hope to get to the point where the help I need is not as substantive as it probably is now. Thanks for your suggestion. – Riccardo Jul 19 '14 at 13:34

0 Answers0