0

I am getting very different results from python statsmodels.api.OLS() and R lm() run on the same data. The R results are about what I expected, in python not so much. I'm sure there's something really basic I've misunderstood... Any help much appreciated.

Python

import statsmodels.formula.api as smf
import pandas as pd

df = pd.DataFrame({'date': [1.5488064e+18, 1.5043968e+18],
                   'count': [15.0, 12.0]})

fit = smf.ols('count~date', data=df).fit()
new_data = pd.DataFrame({'date': [1.398816e+18, 1.337040e+18]})
new_data['count'] = (fit.predict(new_data))
print(new_data)

results in:

           date      count
0  1.398816e+18  12.387341
1  1.337040e+18  11.840278

R

df <- data.frame(date=c(1.5488064e+18, 1.5043968e+18),
                 count=c(15.0, 12.0))
fit <- lm(count~date, data=df)

new_data <- data.frame(date=c(1.398816e+18, 1.337040e+18))
new_data[['count']] <-  predict(fit, new_data)
print(new_data)

results in

          date     count
 1 1.398816e+18 4.8677043
 2 1.337040e+18 0.6945525

seems similar to this and this but nothing in those questions is solving my situation.

  • 2
    Remove the constant `fit <- lm(count~date - 1, data=df)` and should give the same output `predict(fit, new_data)# 1 2 12.38734 11.84028` – akrun May 24 '21 at 23:17
  • 2
    You only have two data points with 2 parameters to estimate. The statsmodel api will have to drop the intercept by default. ie set it to zero, so as to estimate all the other necessary statistics. for example compare python's `fit.summary()` to R's `summary(fit)`, you will notice that the two are different. Either you need to add more data or have to remove the intercept for R's fit model as shown by akrun – Onyambu May 24 '21 at 23:26
  • 1
    Also, when you include a constant, then the design matrix will be very badly scaled, `1` versus `1e+18`. statsmodels `summary` should warn about a huge condition number in that case. – Josef May 25 '21 at 00:24
  • Many thanks for these responses, @Onyambu (and akrun and Josef). There's something I don't understand here. My two data points define a line, and that line has an intercept. I expected the statsmodel API to find that line and intercept. I understand that some of the fit statistics will be underspecified because the sample is small; I expected it to issue a warning or error or fill those statistics with NaNs or something, not return a line that doesn't go through my two points. Perhaps stats model is the wrong tool for this job. – Timothy W. Hilton May 25 '21 at 15:19
  • In that case you will have to use `sklearn.linear_models.LinearRegression`. That will fit the line without the need of other statistics – Onyambu May 25 '21 at 15:49
  • @Onyamby - again, many thanks; this was really helpful. – Timothy W. Hilton May 25 '21 at 15:53
  • @Josef, thanks also for your comment. This is a units problem, it turns out... Those huge date numbers are seconds since the UNIX epoch (pandas.Timestamp values converted to ints). Converting them to days since 1 Jan 2010 and re-running the code returns line through the point, including the intercept. This fit matches the R value. – Timothy W. Hilton May 25 '21 at 15:56
  • statsmodels does what it is asked to do, which is to fit the model y = b * x + u. This is known as regression through the origin. If you want y = a + b * x + u then your formula should be `'count~1+date`, in which case it completely agrees with what R does. The big difference is that R defaults to a constant, and so to get regression through the origin you need to use a `-1` in your formula. statsmodels tried to follow the "Explicit is better than implicit" from the Zen of Python, which means we try and do what the users explicity asks for. – Kevin S May 26 '21 at 08:13

0 Answers0