0

I am using statsmodels to train a linear quantile regression. I have different combinations of features that I need to try and train, but it is as if statsmodels only allows a certain amount of features to be included in the models. I have attached a dummy example below.

# Construct dummy data:
df_dum = pd.DataFrame(columns=['y'])
df_dum['y'] = np.random.normal(size=10)
cols = ['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10','x11','x12','x13','x14','x15','x16','x17','x18','x19','x20','x21','x22','x23','x24']

for i in range(len(cols)):
    df_dum[cols[i]] = np.random.normal(size=10)

# specify feature constellations
feat1 = ['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
feat2 = ['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10','x11','x12','x13','x14','x15']

mod1 = sm.QuantReg(df_dum['y'], df_dum[feat1]).fit(0.50)
mod2 = sm.QuantReg(df_dum['y'], df_dum[feat2]).fit(0.50)

Here mod1 has no problem running but mod2 gives me the error: ValueError: operands could not be broadcast together with shapes (15,) (10,) . So it is as if statsmodels remembers the number of features from the previous model?

andKaae
  • 173
  • 1
  • 13
  • That mean your design matrix `feat2` is singular. This usually means that you some perfectly correlated variables and the rank is smaller than the number of columns. In this case you have more explanatory variables than observations which cannot have full column rank. – Josef Jun 04 '21 at 12:16
  • @Josef you are very right. In this case, increasing the number of observations solve the problem. However, for my real-life problem, this is not the problem. I both have categorical and continuous features. When I only train the model with the categorical features I get no error, when I do it only with the continuous I get the error. It might be that it is because of correlated features, is there no way to still model with these features? – andKaae Jun 05 '21 at 15:34
  • The ValueError only shows up with "perfect" correlation, defined by the noise threshold in numpy linalg function. Quantile regression does not work with this and you will need to drop one of the almost perfectly correlated variables. – Josef Jun 05 '21 at 19:22
  • also check the scale of your explanatory variables. correlation measures like condition number are scale sensitive and can be very large if the scale of variables differs by many orders of magnitude. example https://stackoverflow.com/questions/67679877/python-statsmodel-api-ols-vs-r-lm#comment119628199_67679877 – Josef Jun 05 '21 at 19:28
  • @Josef, after playing around with removing some of the features I can see that you are right. The problem is just that I cannot remove these features. The features which I have problems with are spatial features, so distance points of interest (food, hospitals, schools). As my data is with charging stations, the distance to these does not change over time and will be the same for all events for a station, which causes the correlation. I do however need these as I want to be able to predict for a new station with different distance to the POI's. – andKaae Jun 06 '21 at 06:40
  • You need to have a larger dataset to include all of the features. You should have at least as many observations as you have features, and realistically many more. – Kevin S Jun 08 '21 at 13:33

0 Answers0