I am completely redrafting this question following the advice of @MrFlick.
Assume I have a data.frame
like the following
set.seed(1)
group<-(rep(1:10, sample(50:200, 10, replace=T)))
gender<-factor((sample(0:1, 1328, replace=T, prob=c(0.55, 0.45))))
country<-factor((sample(6030:6098, 1328, replace=T)))
ethnicity<-factor((sample(7040:7101, 1328, replace=T)))
yearbirth<-(sample(1950:1986, 1328, replace=T))
df<-data.frame(group, gender, country, ethnicity, yearbirth)
For each group
, I would like to calculate the Silhouette Width (SW) corresponding to the 'optimal' number of clusters. To do so, I prepared the following function which I would like to repeat on any group
library(cluster)
library(fpc)
ASW<-function(x){
x<-as.data.frame(x)
id<-as.integer(x[1,1])
people<-length(as.vector(x[,1]))
if (people==1){
p=0
} else {
x<-x[,-1]
diss<-daisy(x, metric="gower")
if (people/3<2) {
maxclus=2
} else {
maxclus<-round(people/3)
}
asw <- numeric(maxclus)
for (k in 2:maxclus) asw[[k]] <- pam(diss, k, diss=T) $ silinfo $ avg.width
k.best <- which.max(asw)
p<-asw[k.best]
}
swg<-numeric(2)
swg[1]<-id
swg[2]<-p
swg
}
As a final output, I would like ASW
to produce a data.frame
having the group number (id) in the first column and the Silhouette Width value corresponding to the optimal number of clusters in the second. If the group contains only one individual, I would like Silhouette Width to be 0 - SW is not defined for less than 2 clusters.
Using all variables except for group
I would like to compute a dissimilarity matrix using daisy
from the cluster
package. To my knowledge, daisy
is the only function capable to compute a dissimilarity matrix from mixed variables. Next, I would pass the dissimilarity matrix just produced to pam
and calculate the Silhouette Width for various cluster configurations. To shorten the computing time, especially with large groups, I am imposing a maximum number of clusters equal to one-third the number of individuals in the group.
At this point, I would like the function to take the SW value corresponding to the optimal number of clusters (determined by looking at the maximum Silhouette Width value) and paste it, together with the corresponding group id, in a data.frame
- here called aswout
.
Unfortunately, the function seems not to work properly (I tried it on the first group only) and it's not so clear to me how to get it 'cycle' over all the groups.
I hope the question is clear. Write if there is something you don´t understand and I will add more information. I am really thankful for any help on this!
All the best, Riccardo
EDIT:
The ASW
function now works. I am trying to make it cycle over all groups in a data frame. I learned from another post that it's a bad habit to include data.frame
s within functions that are grown as the function executes. This however was the aim of my aswout
data.frame
. I am now looking for a way to achieve the same result, having the function loop over the groups and giving me an output data.frame
, without including the data.frame
within the function.