6

I've been struggling with fitting a distribution to sample data I have in R. I've looked at using the fitdist as well as fitdistr functions, but I seem to be running into problems with both.

A quick background; the output of my code should be the most fitting distribution (from a list of distributions) to the data provided, with parameters. This needs to happen without human interaction, so comparing graphs is not an option. I was thinking that I could fit each distribution to the data, draw the p-value from the chi-squared test and find the distribution with the highest p-value. I've gotten some success in a normal distribution to the sample data, but as soon as I try to fit something more complex (a gamma distribution, as seen in the code), I get all kinds of errors. What am I doing wrong?

library(fitdistrplus) 
require(MASS) 
set.seed(1) 
testData <- rnorm(1000) 
distlist <- c("norm","unif","exp")

(z <- fitdist(testData,"gamma",start=list(rate=0.1),fix.arg=list(shape=4)))

Examples of errors I get are:

[1] "Error in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, : \n initial value in 'vmmin' is not finite\n" attr(,"class")

and

Error in fitdist(testData, "gamma", start = list(rate = 0.1), fix.arg = list(shape = 4)) : the function mle failed to estimate the parameters, with the error code 100

I know I'm probably implementing the fitdist function incorrectly, but I can't seem to find simple examples I can adapt to achieve my code objectives. Can anyone help?

Martin
  • 277
  • 2
  • 5
  • 17
  • 4
    the error message says it all: the loglikelihood is not finite at the initial value. The gamma distribution has positive support while the sample certainly has negative values, thus the loglikelihood is infinite. – rozsasarpi Apr 24 '15 at 19:40
  • Hm. Never even considered this; you are right. I'll try to put in some controls on the sample data to only include positive data. Thanks for the feedback, man. – Martin Apr 24 '15 at 19:43
  • 3
    closely related: http://stats.stackexchange.com/questions/30491/automatically-determine-probability-distribution-given-a-data-set , http://stackoverflow.com/questions/2661402/given-a-set-of-random-numbers-drawn-from-a-continuous-univariate-distribution-f – Ben Bolker Apr 24 '15 at 19:50
  • 2
    Additionally, I wouldn't recommend to use p-values for model selection, they do not express the probability that the observations are generated by a particular model. [Akaike information criterion](http://en.wikipedia.org/wiki/Akaike_information_criterion) would be a simple, easy to calculate alternative. – rozsasarpi Apr 24 '15 at 19:52
  • @Arpi, thank you very much for the suggestion. I'll read up on the technique and see if it works better. Any help or suggestions are highly valued, so I really appreciate this. – Martin Apr 27 '15 at 09:36
  • @Ben-Bolker thanks very much for the links; it was tricky finding relevant articles on what I wanted to do! – Martin Apr 27 '15 at 09:36

2 Answers2

8

You are looking for the Kolmogorov-Smirnov test. Null hypothesis is that the data sample is from the hypothesised distribution.

fitData <- function(data, fit="gamma", sample=0.5){
 distrib = list()
 numfit <- length(fit)
 results = matrix(0, ncol=5, nrow=numfit)

 for(i in 1:numfit){
if((fit[i] == "gamma") | 
     (fit[i] == "poisson") | 
     (fit[i] == "weibull") | 
     (fit[i] == "exponential") |
     (fit[i] == "logistic") |
     (fit[i] == "normal") | 
     (fit[i] == "geometric")
) 
  distrib[[i]] = fit[i]
else stop("Provide a valid distribution to fit data" )
 }

 # take a sample of dataset
 n = round(length(data)*sample)
 data = sample(data, size=n, replace=F)

 for(i in 1:numfit) {
  if(distrib[[i]] == "gamma") {
  gf_shape = "gamma"
  fd_g <- fitdistr(data, "gamma")
  est_shape = fd_g$estimate[[1]]
  est_rate = fd_g$estimate[[2]]

  ks = ks.test(data, "pgamma", shape=est_shape, rate=est_rate)

  # add to results
  results[i,] = c(gf_shape, est_shape, est_rate, ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "poisson"){
  gf_shape = "poisson"
  fd_p <- fitdistr(data, "poisson")
  est_lambda = fd_p$estimate[[1]]

  ks = ks.test(data, "ppois", lambda=est_lambda)
  # add to results
  results[i,] = c(gf_shape, est_lambda, "NA", ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "weibull"){
  gf_shape = "weibull"
  fd_w <- fitdistr(data,densfun=dweibull,start=list(scale=1,shape=2))
  est_shape = fd_w$estimate[[1]]
  est_scale = fd_w$estimate[[2]]

  ks = ks.test(data, "pweibull", shape=est_shape, scale=est_scale)
  # add to results
  results[i,] = c(gf_shape, est_shape, est_scale, ks$statistic, ks$p.value) 
}

else if(distrib[[i]] == "normal"){
  gf_shape = "normal"
  fd_n <- fitdistr(data, "normal")
  est_mean = fd_n$estimate[[1]]
  est_sd = fd_n$estimate[[2]]

  ks = ks.test(data, "pnorm", mean=est_mean, sd=est_sd)
  # add to results
  results[i,] = c(gf_shape, est_mean, est_sd, ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "exponential"){
  gf_shape = "exponential"
  fd_e <- fitdistr(data, "exponential")
  est_rate = fd_e$estimate[[1]]
  ks = ks.test(data, "pexp", rate=est_rate)
  # add to results
  results[i,] = c(gf_shape, est_rate, "NA", ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "logistic"){
  gf_shape = "logistic"
  fd_l <- fitdistr(data, "logistic")
  est_location = fd_l$estimate[[1]]
  est_scale = fd_l$estimate[[2]]
  ks = ks.test(data, "plogis", location=est_location, scale=est_scale)
  # add to results
  results[i,] = c(gf_shape, est_location, est_scale, ks$statistic,    ks$p.value) 
    }
  }
  results = rbind(c("distribution", "param1", "param2", "ks stat", "ks    pvalue"),   results)
  #print(results)
  return(results)
  }

Applied to your example:

library(MASS)
set.seed(1) 
testData <- rnorm(1000) 
res = fitData(testData, fit=c("logistic","normal","exponential","poisson"),
    sample=1)
res

You do not reject the null hypothesis for the Normal.

Reference: https://web.archive.org/web/20150407031710/http://worldofpiggy.com:80/2014/02/25/automatic-distribution-fitting-r/

pixel
  • 103
  • 3
TinaW
  • 989
  • 1
  • 18
  • 28
  • Oh wow. Thank you so much for this, it explains exactly what I need. I assume I can expand this function to include other distributions as well? Apologies if my question sounds stupid, my knowledge of R is still growing. – Martin Apr 27 '15 at 09:32
  • 1
    Yes, you can extend it to other distributions, but your knowledge about the nature of the data should limit your options to a few distributions. – TinaW Apr 27 '15 at 11:00
  • 1
    KS test is inappropriate for fitted distributions, i.e. where the parameters of the distribution are estimated from the data. In such cases the KS statistic is no lodger distribution-free and the critical values are inaccurate. There are some techniques to modify the statistic in such a way that it does become distribution-free even cases where parameters are estimated, but it is non-trivial (see, for example, [Khmaladze Transform](https://en.wikipedia.org/wiki/Khmaladze_transformation)). – Confounded May 13 '18 at 23:06
  • 1
    The p-values of a Kolmovorov-Smirnov-Test (KS-Test) with estimated parameters will be quite wrong. So unfortunately, you can't just fit a distribution and then use the estimated parameters in a Kolmogorov-Smirnov-Test to test your sample. Look at this post https://stats.stackexchange.com/questions/132652/how-to-determine-which-distribution-fits-my-data-best – Kevin Zhu Jan 21 '20 at 07:27
1

I consider the error is mainly because of your data. As seen in the error message, NaN is created so that the function seems to fail to obtain the score (by differentiating the density function). [range of density function is non-negative, isn't?]

Method of moments, which is simpler, is used instead of maximum likelihood estimation and it produces parameter estimates in spite of a warning.

library(fitdistrplus) 
require(MASS) 
set.seed(1) 
testData <- rnorm(1000) 
fitdist(testData, "gamma", method = "mme", start = list(shape = 0.1, rate = 0.1))

Fitting of the distribution ' gamma ' by matching moments 
Parameters:
           estimate
shape  0.0001268054
rate  -0.0108863200
Warning message:
In dgamma(c(-0.626453810742332, 0.183643324222082, -0.835628612410047,  :
  NaNs produced
Jaehyeon Kim
  • 1,328
  • 11
  • 16
  • Thank you so much for the detailed answer. I actually never considered using MME, which is the superior fit for this problem as you noted. I assume I can simply draw the chi-squared p-value from this using something like gofstat? – Martin Apr 27 '15 at 09:34
  • 1
    MME just uses moments to fit distribution while MLE uses more information by fitting likelihood function and, I guess, it is why the former at least returns an outcome. | The domain of the gamma distribution is [0, infinity) while it is (-infinity, infinity) for the normal distribution so that negative realizations of the random sample would cause a problem. Therefore, if values are strictly controlled, I guess `fitdist()` should return an outcome. | The second way may also be used as a way of non-parametric estimation. – Jaehyeon Kim Apr 27 '15 at 09:54