I'm working on a problem to forecast future electronic store sales from historical data. One of the features I'm using is item price (float). I've found experimentally that adding this to an existing list of features degrades fitting and validation accuracy (increases prediction RMSE) of my xgboost
model. I suspect that the impact of price may be highly non-linear, with peaks at the prices of memory sticks, laptops, cell phones, etc.
Anyway, I got the following idea to cope with this: How about if I convert the float item price to a categorical variable, with ability to specify the mapping, e.g., ranges of values or deciles? Then, I could mean-encode that categorical variable using the training target value item price.
Does this make sense? Could you give me a pointer to a Python "linear/decile histogrammer" that returns, for a list of float quantity, return a parallel list of which bin/decile each float belongs to?