5

This question is a continuation of the same thread here. Below is a minimal working example taken from this book:

Wehrens R. Chemometrics with R multivariate data analysis in the natural sciences and life sciences. 1st edition. Heidelberg; New York: Springer. 2011. (page 250).

The example was taken from this book and its package ChemometricsWithR. It highlighted some pitfalls when modeling using cross-validation techniques.

The Aim:
A cross-validated methodology using the same set of repeated CV to perform a known strategy of PLS followed typically by LDA or cousins like logistic regression, SVM, C5.0, CART, with the spirit of caret package. So PLS would be needed every time before calling the waiting classifier in order to classify PLS score space instead of the observations themselves. The nearest approach in the caret package is doing PCA as a pre-processing step before modeling with any classifier. Below is a PLS-LDA procedure with only one cross-validation to test performance of the classifier, there was no 10-fold CV or any repetition. The code below was taken from the mentioned book but with some corrections otherwise throws error:

library(ChemometricsWithR)
data(prostate)
prostate.clmat <- classvec2classmat(prostate.type) # convert Y to a dummy var

odd <- seq(1, length(prostate.type), by = 2) # training
even <- seq(2, length(prostate.type), by = 2) # holdout test

prostate.pls <- plsr(prostate.clmat ~ prostate, ncomp = 16, validation = "CV", subset=odd)

Xtst <- scale(prostate[even,], center = colMeans(prostate[odd,]), scale = apply(prostate[odd,],2,sd))

tst.scores <- Xtst %*% prostate.pls$projection # scores for the waiting trained LDA to test

prostate.ldapls <- lda(scores(prostate.pls)[,1:16],prostate.type[odd]) # LDA for scores
table(predict(prostate.ldapls, new = tst.scores[,1:16])$class, prostate.type[even])

predictionTest <- predict(prostate.ldapls, new = tst.scores[,1:16])$class)

library(caret)    
confusionMatrix(data = predictionTest, reference= prostate.type[even]) # from caret

Output:

Confusion Matrix and Statistics

          Reference
Prediction bph control pca
   bph       4       1   9
   control   1      35   7
   pca      34       4  68

Overall Statistics

               Accuracy : 0.6564          
                 95% CI : (0.5781, 0.7289)
    No Information Rate : 0.5153          
    P-Value [Acc > NIR] : 0.0001874       

                  Kappa : 0.4072          
 Mcnemar's Test P-Value : 0.0015385       

Statistics by Class:

                     Class: bph Class: control Class: pca
Sensitivity             0.10256         0.8750     0.8095
Specificity             0.91935         0.9350     0.5190
Pos Pred Value          0.28571         0.8140     0.6415
Neg Pred Value          0.76510         0.9583     0.7193
Prevalence              0.23926         0.2454     0.5153
Detection Rate          0.02454         0.2147     0.4172
Detection Prevalence    0.08589         0.2638     0.6503
Balanced Accuracy       0.51096         0.9050     0.6643

However, the confusion matrix didn't match that in the book, anyway the code in the book did break, but this one here worked with me!

Notes:
Although this was only one CV, but the intention is to agree on this methodology first, sd and mean of the train set were applied on the test set, PLUS transformed into PLS scores based a specific number of PC ncomp. I want this to occur every round of the CV in the caret. If the methodology as code is correct here, then it can serve, may be, as a good start for a minimal work example while modifying the code of the caret package.

Side Notes:
It can be very messy with scaling and centering, I think some of the PLS functions in R do scaling internally, with or without centering, I am not sure, so building a custom model in caret should be handled with care to avoid both lack or multiple scalings or centerings (I am on my guards with these things).

Perils of multiple centering/scaling
The code below is just to show how multliple centering/scaling can change the data, only centering is shown here but the same problem with scaling applies too.

set.seed(1)
x <- rnorm(200, 2, 1)
xCentered1 <- scale(x, center=TRUE, scale=FALSE)
xCentered2 <- scale(xCentered1, center=TRUE, scale=FALSE)
xCentered3 <- scale(xCentered2, center=TRUE, scale=FALSE)
sapply (list(xNotCentered= x, xCentered1 = xCentered1, xCentered2 = xCentered2, xCentered3 = xCentered3), mean)

Output:

xNotCentered    xCentered1    xCentered2    xCentered3 
 2.035540e+00  1.897798e-16 -5.603699e-18 -5.332377e-18

Please drop a comment if I am missing something somewhere in this course. Thanks.

Community
  • 1
  • 1
doctorate
  • 1,381
  • 1
  • 19
  • 43
  • 1
    I don't think caret yet supports customer methods for preProcess. However, you could build a custom model flow that incorporates the pre-processing: http://caret.r-forge.r-project.org/custom_models.html – Zach Jan 12 '14 at 03:47
  • thanks. I looked into the custom model, the complication that I didn't find something near is where in the code to tell `train` that the scores of the training CV fold is needed. Typically, I would try first the PLS-LDA, if works then do the same for other classifiers. It is like a prototype model. So can you provide the code of how to custom PLS-LDA first? – doctorate Jan 13 '14 at 10:32
  • I voted to move this to stackoverflow, where I think it is more appropriate. – cbeleites unhappy with SX Jan 13 '14 at 12:51
  • @doctorate You would define a custom model that would, given a dataset, fit a pls-lda model. Then you would write a predict function for your model that would, given a model fit and a test set, make predictions. Then you give these functions to caret as a custom method, and caret will handle passing the correct data to each one for each resample of your dataset. – Zach Jan 13 '14 at 14:39
  • @Zach: the package I mentioned in the old thread has the model fitting `plslda` as well as `predict.plslda` (and a few more functions like `coef` and for some post-processing). No support for `caret` so far, though. – cbeleites unhappy with SX Jan 13 '14 at 18:28
  • @Zach, can you pls have a look at the PLS-LDA custom model below where it fails? may be you have some better way to benchmark the results with other method of your liking (I suggested the iris data set, but I don't have an independent code for PLS-LDA model). that would be nice. – doctorate Jan 14 '14 at 11:28

4 Answers4

8

If you want to fit these types of models with caret, you would need to use the latest version on CRAN. The last update was created so that people can use non-standard models as they see fit.

My approach below is to jointly fit the PLS and other model (I used random forest in the example below) and tune them at the same time. So for each fold, a 2D grid of ncomp and mtry is used.

The "trick" is to attached the PLS loadings to the random forest object so that they can be used during prediction time. Here is the code that defines the model (classification only):

 modelInfo <- list(label = "PLS-RF",
              library = c("pls", "randomForest"),
              type = "Classification",
              parameters = data.frame(parameter = c('ncomp', 'mtry'),
                                      class = c("numeric", 'numeric'),
                                      label = c('#Components', 
                                                '#Randomly Selected Predictors')),
              grid = function(x, y, len = NULL) {
                grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
                            mtry = 1:len)
                grid <- subset(grid, mtry <= ncomp)
                },
              loop = NULL,
              fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                     ## First fit the pls model, generate the training set scores,
                     ## then attach what is needed to the random forest object to 
                     ## be used later
                     pre <- plsda(x, y, ncomp = param$ncomp)
                     scores <- pls:::predict.mvr(pre, x, type = "scores")
                     mod <- randomForest(scores, y, mtry = param$mtry, ...)
                     mod$projection <- pre$projection
                     mod
                   },
                   predict = function(modelFit, newdata, submodels = NULL) {       
                     scores <- as.matrix(newdata)  %*% modelFit$projection
                     predict(modelFit, scores)
                   },
                   prob = NULL,
                   varImp = NULL,
                   predictors = function(x, ...) rownames(x$projection),
                   levels = function(x) x$obsLevels,
                   sort = function(x) x[order(x[,1]),])

and here is the call to train:

 library(ChemometricsWithR)
 data(prostate)

 set.seed(1)
 inTrain <- createDataPartition(prostate.type, p = .90)
 trainX <-prostate[inTrain[[1]], ]
 trainY <- prostate.type[inTrain[[1]]]
 testX <-prostate[-inTrain[[1]], ]
 testY <- prostate.type[-inTrain[[1]]]

 ## These will take a while for these data
 set.seed(2)
 plsrf <- train(trainX, trainY, method = modelInfo,
                preProc = c("center", "scale"),
                tuneLength = 10,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5))

 ## How does random forest do on its own?
 set.seed(2)
 rfOnly <- train(trainX, trainY, method = "rf",
                tuneLength = 10,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5))

Just for kicks, I got:

 > getTrainPerf(plsrf)
   TrainAccuracy TrainKappa method
 1     0.7940423    0.65879 custom
 > getTrainPerf(rfOnly)
   TrainAccuracy TrainKappa method
 1     0.7794082  0.6205322     rf

and

 > postResample(predict(plsrf, testX), testY)
  Accuracy     Kappa 
 0.7741935 0.6226087 
 > postResample(predict(rfOnly, testX), testY)
  Accuracy     Kappa 
 0.9032258 0.8353982 

Max

brentonk
  • 1,308
  • 1
  • 13
  • 14
topepo
  • 13,534
  • 3
  • 39
  • 52
  • thanks a lot, one thing, I compared the above with the code in the `caret` subdirectory `models` for `method="rf"`, I found predict with type = "prob" so should the above code be: `predict(modelFit, scores, type="prob")`?. – doctorate Jan 14 '14 at 10:33
  • can you please check the below code for PLS-LDA? My guess is not, why? because I tried the Sonar data in the caret vignette, one time with method="pls", the built-in, and second time with the custom PLS-LDA below, and the results were exactly the same even to the last digit, which cannot be, so the code is not doing the desired effect for some reason, so please for your kind revision. – doctorate Jan 14 '14 at 14:26
  • 1
    Your code looks fine. I would guess that the results are the same (in this case) because my `plsda` function can produce class probabilities using Bayes' Rule (using the `NaiveBayes` function in the klaR package). With two classes, it is possible LDA and naive Bayes will produce the same results (if you assume normality when computing the naive Bayes model). – topepo Jan 14 '14 at 16:38
  • most probably you were right with the two-class problem, I posted the iris data, and there was no such an issue. Thanks for your support. – doctorate Jan 14 '14 at 17:25
  • @topepo: Can this be also used for model ensembling? e.g. random forest and SVR? Thanks! – jpcgandre Sep 05 '14 at 17:00
4

Based on Max's valuable comments, I felt the need to have IRIS referee, which is famous for classification, and more importantly the Species outcome has more than two classes, which would be a good data set to test the PLS-LDA custom model in caret:

data(iris)
names(iris)
head(iris)
dim(iris) # 150x5
set.seed(1)
inTrain <- createDataPartition(y = iris$Species,
                               ## the outcome data are needed
                               p = .75,
                               ## The percentage of data in the
                               ## training set
                               list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Iris
## that belong in the training set.
training <- iris[ inTrain,] # 114
testing <- iris[-inTrain,] # 36

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 5,
                     classProbs = TRUE)
set.seed(2)
plsFitIris <- train(Species ~ .,
                   data = training,
                   method = "pls",
                   tuneLength = 4,
                   trControl = ctrl,
                   preProc = c("center", "scale"))
plsFitIris
plot(plsFitIris)


set.seed(2)
plsldaFitIris <- train(Species ~ .,
                      data = training,
                      method = modelInfo,
                      tuneLength = 4,
                      trControl = ctrl,
                      preProc = c("center", "scale"))

plsldaFitIris
plot(plsldaFitIris)  

Now comparing the two models:

getTrainPerf(plsFitIris)
  TrainAccuracy TrainKappa method
1     0.8574242  0.7852462    pls
getTrainPerf(plsldaFitIris)
  TrainAccuracy TrainKappa method
1      0.975303  0.9628179 custom
postResample(predict(plsFitIris, testing), testing$Species)
Accuracy    Kappa 
   0.750    0.625 
postResample(predict(plsldaFitIris, testing), testing$Species)
 Accuracy     Kappa 
0.9444444 0.9166667  

So, finally there was the EXPECTED difference, and improvement in the metrics. So this would support Max's notion, that two-class problems because of Bayes' probabilistic approach of plsda function both lead to the same results.

doctorate
  • 1,381
  • 1
  • 19
  • 43
3
  • You need to wrap the CV around both PLS and LDA.
  • Yes, both plsr and lda center the data their own way
  • I had a closer look at caret::preProcess (): as it is defined now, you will not be able to use PLS as preprocessing method because it is supervised but caret::preProcess () uses unsupervised methods only (there is no way to hand over the dependent variable). This would probably make patching rather difficult.

  • So inside the caret framework, you'll need to go for a custom model.

cbeleites unhappy with SX
  • 13,717
  • 5
  • 45
  • 57
  • +1, for the note that custom model is the way to do it. If `lda` `pls` would do centering, and caret would do centering/scaling again, then we are in trouble (see perils of multiple centering/scaling), I wonder if caret package is aware of this *subtle* issue? I would be very grateful if you provide the code for the prototype model PLS-LDA. – doctorate Jan 13 '14 at 13:47
  • @doctorate: please send me an email to chemometrie@beleites.de indicating also whether you need a .zip for Win or .tar.gz (or both) as you probably do not want to check out the whole hyperSpec svn repo from r-forge (> 4GB) to get a few kB of code... – cbeleites unhappy with SX Jan 13 '14 at 14:05
  • @doctorate Caret does not automatically center and scale your data, unless you set preProcess=c('center', 'scale'). – Zach Jan 13 '14 at 14:34
  • @Zach, so what should one do normally to avoid multiple preProces, based on cbeleites, if `lda` would do centering (own centering), and I pass naively `preProc = c("center","scale"),` then I am in trouble, right? otherwise what would you do normally to avoid this issue of multiple centering and scaling? – doctorate Jan 13 '14 at 14:41
  • @cbeleites, thanks a lot for your generous offer, already sent. – doctorate Jan 13 '14 at 14:46
  • @doctorate: What I do in the package is that I make sure the center that is used before PLS corresponds to the center `MASS::lda` uses in the 2nd step. (Unfortunately, both `MASS::lda` and `pls::plsr` have slightly different centering strategies hardcoded, so I use a patched `plsr` function.) – cbeleites unhappy with SX Jan 13 '14 at 18:26
  • @cbeleites, can you pls post a cross-validated iris dataset with pls-lda method using your code? using the same set.seed(xxx) number and the same repition of the 10kfold cv say 3 (just to lessen the burden of computation), this way one can compare the performance of the below custom code from `caret` see the trial below. thanks in advance. – doctorate Jan 14 '14 at 11:26
0

If the scenario were to custom a model of PLS-LDA type, according to the code kindly provided by Max (maintainer of CARET), something is not corect in this code, but I didn't figure it out, because I used the Sonar data set the same in caret vignette and tried to reproduce the result one time using method="pls" and another time using the below custom model for PLS-LDA, the results were exactly identical even to the last digit, which was nonsensical. For benchmarking, one need a known data set (I think a cross-validated PLS-LDA for iris data set would fit here as it is famous for this type of analysis and there should be somewhere a cross-validated treatment of it), everything should be the same (the set.seed(xxx) and the no of K-CV repitition) except the code in question so as to rightly compare and to judge the code below:

modelInfo <- list(label = "PLS-LDA",
                  library = c("pls", "MASS"),
                  type = "Classification",
                  parameters = data.frame(parameter = c("ncomp"),
                                          class = c("numeric"),
                                          label = c("#Components")),
                  grid = function(x, y, len = NULL) {
                    grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
                  },
                  loop = NULL,
                  fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                    ## First fit the pls model, generate the training set scores,
                    ## then attach what is needed to the lda object to 
                    ## be used later
                    pre <- plsda(x, y, ncomp = param$ncomp)
                    scores <- pls:::predict.mvr(pre, x, type = "scores")
                    mod <- lda(scores, y, ...)
                    mod$projection <- pre$projection
                    mod
                  },
                  predict = function(modelFit, newdata, submodels = NULL) {       
                    scores <- as.matrix(newdata)  %*% modelFit$projection
                    predict(modelFit, scores)$class
                  },
                  prob = function(modelFit, newdata, submodels = NULL) {       
                    scores <- as.matrix(newdata)  %*% modelFit$projection
                    predict(modelFit, scores)$posterior
                  },
                  varImp = NULL,
                  predictors = function(x, ...) rownames(x$projection),
                  levels = function(x) x$obsLevels,
                  sort = function(x) x[order(x[,1]),])  

Based on Zach's request, the code below is for method="pls" in caret, exactly the same concrete example in caret vigenette on CRAN:

library(mlbench) # data set from here
data(Sonar)
dim(Sonar) # 208x60
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class,
                               ## the outcome data are needed
                               p = .75,
                               ## The percentage of data in the
                               ## training set
                               list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Sonar
## that belong in the training set.
training <- Sonar[ inTrain,] #157
testing <- Sonar[-inTrain,] # 51

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 3,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)
set.seed(108)
plsFitSon <- train(Class ~ .,
                data = training,
                method = "pls",
                tuneLength = 15,
                trControl = ctrl,
                metric = "ROC",
                preProc = c("center", "scale"))
plsFitSon
plot(plsFitSon) # might be slightly difference than what in the vignette due to radnomness  

Now, the code below is a pilot run to classify Sonar data using the custom model PLS-LDA which is under question, it is expected to come up with any numbers apart from identical with those using PLS only:

set.seed(108)
plsldaFitSon <- train(Class ~ .,
                   data = training,
                   method = modelInfo,
                   tuneLength = 15,
                   trControl = ctrl,
                   metric = "ROC",
                   preProc = c("center", "scale"))  

Now comparing the results between the two models:

getTrainPerf(plsFitSon)
   TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381    pls
getTrainPerf(plsldaFitSon)
   TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 custom

postResample(predict(plsFitSon, testing), testing$Class)
Accuracy    Kappa 
0.745098 0.491954 
postResample(predict(plsldaFitSon, testing), testing$Class)
Accuracy    Kappa 
0.745098 0.491954 

So, the results are exactly the same which cannot be. As if the lda model were not added?

doctorate
  • 1,381
  • 1
  • 19
  • 43
  • If I copy and paste your `modelInfo` and Max's train/test code, I get different results for method=modelInfo and method='lda'. Please post your code (complete with random seeds) where you get different results. – Zach Jan 14 '14 at 15:50
  • @Zach, I added the Sonar example. But, if you've tried `method="pls"` and then `modelInfo` which is the custom PLS-LDA, you would get same results, I guess. – doctorate Jan 14 '14 at 16:36
  • I was able to reproduce the same thing in the sonar data. See my comment above as to why (or at least my guess why). – topepo Jan 14 '14 at 16:39
  • Hi, How can I implement `MASS::polr` ordered logistic regression in caret method? Many thanks – Rafik Margaryan Jun 17 '14 at 20:33
  • For the training, it seems that you modified the train(X,Y) form but you supplied the train(Class ~ ., ...) form. Try extracting the X, Y from training and use the train(X, Y) format. –  Feb 24 '15 at 15:24