1

With regard to this link Predicted probabilities in R ranger package, I have a question.

Imagine I have a mixed data frame, df (comprising of factor and numeric variables) and I want to do classification using ranger. I am splitting this data frame as test and train sets as Train_Set and Test_Set. BiClass is my prediction factor variable and comprises of 0 and 1 (2 levels)

I want to calculate and attach class probabilities to the data frame using ranger using the following commands:

Biclass.ranger <- ranger(BiClass ~ ., ,data=Train_Set, num.trees = 500, importance="impurity", save.memory = TRUE, probability=TRUE)

probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 200, type='response', verbose = TRUE)$predictions)

The data frame probabilities is a data frame consisting of 2 columns (0 and 1) with number of rows equal to the number of rows in Test_Set.

Does it mean, if I append or attach this data frame, namely, probabilities to the Test_Set as the last two columns, it shows the probability of each row being either 0 or 1? Is my understanding correct?

My second question, when I attempt to calcuate confusion matrix through

pred = predict(Biclass.ranger, data=Test_Set, num.trees = 500, type='response', verbose = TRUE)
table(Test_Set$BiClass, pred$predictions)

I get the following error: Error in table(Test_Set$BiClass, pred$predictions) : all arguments must have the same length

What am I doing wrong?

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
Ray
  • 321
  • 2
  • 12

1 Answers1

1

For your first question yes, it shows the probability of each row being 0 or 1. Using the example below:

library(ranger)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor",1,0))
Train_Set = data[idx,]
Test_Set = data[-idx,]

mdl <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(mdl, data = Test_Set,type='response', verbose = TRUE)$predictions)

We can always check whether they agree:

par(mfrow=c(1,2))
boxplot(probabilities[,"0"] ~ Test_Set$Species,ylab="Prob 0",xlab="Actual label")
boxplot(probabilities[,"1"] ~ Test_Set$Species,ylab="Prob 1",xlab="Actual label")

enter image description here

Not the best plot, but sometimes if the labels are flipped you will see something weird. We need to find the column that has the max probability and assign the label, for this we do:

max.col(probabilities) - 1
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0
[39] 0 0 0 0 0 0 0 0 0 0 0 0

This goes through each row of probabilities returns 1 or 2 depending on which column has maximum probability and we simply subtract 1 from it to get 0,1. For the confusion matrix:

caret::confusionMatrix(table(max.col(probabilities) - 1,Test_Set$Species))
Confusion Matrix and Statistics


     0  1
  0 31  2
  1  0 17

               Accuracy : 0.96            
                 95% CI : (0.8629, 0.9951)
    No Information Rate : 0.62            
    P-Value [Acc > NIR] : 2.048e-08 

In your case, you can just do:

confusionMatrix(table(max.col(probabilities)-1,Test_Set$BiClass))
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • sorry can you be a bit more specific... if you run ranger with probability=TRUE , you will not get a confusionMatrix. If you run it with probability=TRUE, then if you apply the predicted labels onto caret, you get the same – StupidWolf Apr 29 '20 at 10:36
  • It seems like it's a separate issue from this question you posted, i suggest if the problem persist you post it as another question, regarding the difference between caret and ranger confusion matrix, with a reproducible example – StupidWolf Apr 29 '20 at 10:37
  • Sorry, just found that confusionMatrix is from caret package. – Ray Apr 29 '20 at 11:58
  • Since the results of confusionMatrix(table(Test_Set$Species, max.col(probabilities)-1)) and caret::confusionMatrix(table(max.col(probabilities) - 1,Test_Set$Species)) are transposes, which is the correct way to construct confusion matrix? The role of sensitivity, specificivity, ppv and npv reverses in the above commands. – Ray Apr 29 '20 at 12:02
  • 1
    Ok i see your point now. sorry a bit lengthy. It should be predicted first, then reference, https://www.rdocumentation.org/packages/caret/versions/6.0-86/topics/confusionMatrix. see the last examples using table() – StupidWolf Apr 29 '20 at 12:08
  • 1
    sorry i typed it wrong for you, for your data, do confusionMatrix(table(Tmax.col(probabilities)-1,Test_Set$BiClass, )), i corrected it now, sorry again for the confusion – StupidWolf Apr 29 '20 at 12:10
  • Many thanks for your answer. In the meanwhile, I posted it as a separate question with one more issue if I change the class and understand which entries will be tp,tn, fp and fn: https://stackoverflow.com/questions/61501935/construction-of-confusion-matrix – Ray Apr 29 '20 at 12:36
  • Do you know what is going wrong when probability= TRUE in the following link: https://stackoverflow.com/questions/68664858/error-in-calculating-confusion-matrix-or-contigency-table-for-multiclassificatio – Ray Aug 06 '21 at 08:25