3

I am trying to fit a lognormal distribution using Scipy. I've already done it using Matlab before but because of the need to extend the application beyond statistical analysis, I am in the process of trying to reproduce the fitted values in Scipy.

Below is the Matlab code I used to fit my data:

% Read input data (one value per line)
x = [];
fid = fopen(file_path, 'r'); % reading is default action for fopen
disp('Reading network degree data...');
if fid == -1
    disp('[ERROR] Unable to open data file.')
else
    while ~feof(fid)
        [x] = [x fscanf(fid, '%f', [1])];

    end
    c = fclose(fid);
    if c == 0
         disp('File closed successfully.');
    else
        disp('[ERROR] There was a problem with closing the file.');
    end
end

[f,xx] = ecdf(x);
y = 1-f;

parmhat  = lognfit(x); % MLE estimate
mu = parmhat(1);
sigma = parmhat(2);

And here's the fitted plot:

enter image description here

Now here's my Python code with the aim of achieving the same:

import math
from scipy import stats
from statsmodels.distributions.empirical_distribution import ECDF 

# The same input is read as a list in Python
ecdf_func = ECDF(degrees)
x = ecdf_func.x
ccdf = 1-ecdf_func.y

# Fit data
shape, loc, scale = stats.lognorm.fit(degrees, floc=0)

# Parameters
sigma = shape # standard deviation
mu = math.log(scale) # meanlog of the distribution

fit_ccdf = stats.lognorm.sf(x, [sigma], floc=1, scale=scale) 

Here's the fit using the Python code.

enter image description here

As you can see, both sets of code are capable of producing good fits, at least visually speaking.

Problem is that there is a huge difference in the estimated parameters mu and sigma.

From Matlab: mu = 1.62 sigma = 1.29. From Python: mu = 2.78 sigma = 1.74.

Why is there such a difference?

Note: I have double checked that both sets of data fitted are exactly the same. Same number of points, same distribution.

Your help is much appreciated! Thanks in advance.

Other info:

import scipy
import numpy
import statsmodels

scipy.__version__
'0.9.0'

numpy.__version__
'1.6.1'

statsmodels.__version__
'0.5.0.dev-1bbd4ca'

Version of Matlab is R2011b.

Edition:

As demonstrated in the answer below, the fault lies with Scipy 0.9. I am able to reproduce the mu and sigma results from Matlab using Scipy 11.0.

An easy way to update your Scipy is:

pip install --upgrade Scipy

If you don't have pip (you should!):

sudo apt-get install pip
Mike
  • 683
  • 10
  • 22
  • 1
    Looking at the two sets of datapoints, they look rather different (compare, for example, the positions of the blue circles in the bottom-right corner). If the data isn't identical, there's no reason to think that the fits would be. – NPE Mar 26 '13 at 06:29
  • Both sets of data are *exactly* the same. I've checked it thoroughly to make sure that's not the case. The plots appear slightly different because the code I used to plot in Matlab is non-library code. Anyway, point is that the data fitted are exactly the same and so they should yield the same mean and standard deviation values. – Mike Mar 26 '13 at 06:40
  • I am sorry, but I don't buy this (unless the plots are way off). Just visually compare the abscissa of the rightmost points on both graph to see that they are *very* different. If you are positive that the data is the same, please include it it your question together with the code you use to read it into both Python and MATLAB. – NPE Mar 26 '13 at 06:44
  • I used Clauset's plplot library to plot the graph in Matlab. Code: http://tuvalu.santafe.edu/~aaronc/powerlaws/. The way the code plots is slightly different from the library function. I can guarantee both sets of data are *exactly* the same. – Mike Mar 26 '13 at 06:48
  • Ok, just found out I used the wrong plot from Matlab!! I've edited accordingly. Regardless, the data fitted are the same. I've double checked the input distribution are identical before fitting. – Mike Mar 26 '13 at 06:56
  • 1
    Hmm looking at the plots it seems that the matlab estimate is nearly always smaller than your data while the python one seems kinda balanced. Suppose matlab does something more complicated (also note the absence of sigma in the fit command). – bdecaf Mar 26 '13 at 08:02

1 Answers1

6

There is a bug in the fit method in scipy 0.9.0 that has been fixed in later versions of scipy.

The output of the script below should be:

Explicit formula:   mu = 4.99203450, sig = 0.81691086
Fit log(x) to norm: mu = 4.99203450, sig = 0.81691086
Fit x to lognorm:   mu = 4.99203468, sig = 0.81691081

but with scipy 0.9.0, it is

Explicit formula:   mu = 4.99203450, sig = 0.81691086
Fit log(x) to norm: mu = 4.99203450, sig = 0.81691086
Fit x to lognorm:   mu = 4.23197270, sig = 1.11581240

The following test script shows three ways to get the same results:

import numpy as np
from scipy import stats


def lognfit(x, ddof=0):
    x = np.asarray(x)
    logx = np.log(x)
    mu = logx.mean()
    sig = logx.std(ddof=ddof)
    return mu, sig


# A simple data set for easy reproducibility
x = np.array([50., 50, 100, 200, 200, 300, 500])

# Explicit formula
my_mu, my_sig = lognfit(x)

# Fit a normal distribution to log(x)
norm_mu, norm_sig = stats.norm.fit(np.log(x))

# Fit the lognormal distribution
lognorm_sig, _, lognorm_expmu = stats.lognorm.fit(x, floc=0)

print "Explicit formula:   mu = %10.8f, sig = %10.8f" % (my_mu, my_sig)
print "Fit log(x) to norm: mu = %10.8f, sig = %10.8f" % (norm_mu, norm_sig)
print "Fit x to lognorm:   mu = %10.8f, sig = %10.8f" % (np.log(lognorm_expmu), lognorm_sig)

With the option ddof=1 in the std. dev. calculation to use the unbiased variance estimation:

In [104]: x
Out[104]: array([  50.,   50.,  100.,  200.,  200.,  300.,  500.])

In [105]: lognfit(x, ddof=1)
Out[105]: (4.9920345004312647, 0.88236457185021866)

There is a note in matlab's lognfit documentation that says when censoring is not used, lognfit computes sigma using the square root of the unbiased estimator of the variance. This corresponds to using ddof=1 in the above code.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • My MATLAB gives `4.9920 0.8824` on this dataset. – NPE Mar 26 '13 at 09:24
  • @NPE: thanks; I added some additional comments that appear to explain the difference. – Warren Weckesser Mar 26 '13 at 09:35
  • 1
    Not 100% sure if this would explain the difference in the OP's case (given the number of observations), but nice investigation nonetheless. (+1) – NPE Mar 26 '13 at 09:36
  • Right: the update with ddof explains your `0.8824`, but I don't know yet if any of this helps the OP. :) – Warren Weckesser Mar 26 '13 at 09:54
  • Hi all, thank you so much for all your responses so far. This test is particularly intriguing and hopefully will bring out the underlying issue here. I've tried following the test code but for some reason, my Matlab (R2011b) is giving me an error on the ddof flag: "Error: The expression to the left of the equals sign is not a valid target for an assignment." – Mike Mar 26 '13 at 12:50
  • Ok, I just ran lognfit on Matlab without the ddof flag and it's given me [4.9920 0.8824]. I executed the same Python code (copied & pasted) and for "Fix x to lognorm", I got [4.2319720, 1.11581240]. Other results are same. – Mike Mar 26 '13 at 13:33
  • The `ddof` argument is only for the python function that I wrote, not for matlab. And it would only make a noticeable difference with a small number of data points. – Warren Weckesser Mar 26 '13 at 14:12
  • O ok. Apologies for the misunderstanding. In that case, it seems to me that Scipy is returning the wrong values? What version of Scipy are you running? – Mike Mar 26 '13 at 14:20
  • 1
    I just updated my answer with a note about a bug in scipy 0.9.0, which I see is the version you are using. If you can't upgrade scipy, the simple workaround is to not use lognorm.fit; either of the alternatives I show in my code should work fine. – Warren Weckesser Mar 26 '13 at 14:21
  • That's brilliant! We've actually identified a bug! – Mike Mar 26 '13 at 14:23
  • Ok I just updated my Scipy to version 11.0 and re-ran my experiment. I can now confirm that the fault is with Scipy 0.9!! Brilliant work Warren! – Mike Mar 26 '13 at 14:38