1

I have created a logistic regression model in r to try to predict the outcome of cricket matches. However, my model produces probability values greater than 1. The output is 1.031704 Any tips on how I could improve my model to get an accurate estimation of probability?

set.seed(1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(ODIMT), replace=TRUE, prob=c(0.7,0.3))
train <- ODIMT[sample, ]
test <- ODIMT[!sample, ] 


model <- glm(Result~Target+Opposition+Country, family="binomial", data=ODIMT)

options(scipen=999)

summary(model)

pscl::pR2(model)["McFadden"]
caret::varImp(model)
car::vif(model)

new <- data.frame(Target = 226,Opposition = "v India", Country = "England")
predict(model, new, type="response")

Result variable is 1 or 0, Target is 0-400, and the other two are character variables.

data:

            Country    Target       Result Opposition    Ground
          England         NA          1     v India   Kolkata
          Australia      251          0  v Pakistan   Kolkata
          South Africa   168          0     v India     Delhi
          Bangladesh      NA          1  v Pakistan     Delhi
          England        306          0 v Australia Melbourne
          New Zealand     NA          1 v Sri Lanka Melbourne

Output of summary:

enter image description here

  • 2
    That shouldn't happen. Please provide a reproducible example (we are missing the data). (Btw. only three lines of your code seem to be relevant.) – Roland Sep 14 '22 at 10:31
  • This doesn't seem possible. What do you get if you use predict without type = "response"? This should give you the log odds, which you can convert to probability. There is no value that would lead to p > 1. Please include your data so we can try to replicate this. – Allan Cameron Sep 14 '22 at 10:33
  • I have now uploaded the head of the data – user18723720 Sep 14 '22 at 10:52
  • No change, if type = "response" is removed. – user18723720 Sep 14 '22 at 10:53
  • Any chance you could share the output of `summary(model)`. Were there any warnings or messages when running `glm`? – Benjamin Sep 14 '22 at 11:10
  • No warnings. Added output of summary now @Benjamin – user18723720 Sep 14 '22 at 11:18
  • This is curious. Based on your summary, the predicted `y` would be `y <- 3.215183 + -0.020266 * 226 + 0.097631 + 2.298948` which equals 1.031646 (which is relatively close to your result, and might be rounding error). What happens if you run `predict.glm(model, new, type="response")`? I'm wondering if there's a dispatch issue. (not a very good guess, but it's worth looking at) – Benjamin Sep 14 '22 at 11:46
  • predict.glm produces results that makes much more sense! – user18723720 Sep 14 '22 at 11:50
  • In that case, I'd be curious to see what `class(model)` returns. if `predict` is returning the log-odds (~1.03) and `predict.glm` is returning the probability (~0.7), that would suggest that the `predict` generic is not seeing `model` as a `glm` object. – Benjamin Sep 14 '22 at 12:04
  • class(model) returns "glm" "lm" – user18723720 Sep 14 '22 at 15:12

1 Answers1

1

I think you are predicting the log-odds value. From the docs:

the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.

As noted in the comments, if you use type="response" you get the predicted probabilities.

Have a look at this question for more info.

s_pike
  • 1,710
  • 1
  • 10
  • 22