Consistency of categorical encodings in h2o (and R) for training and new test sample

Question

I'm having trouble understanding whether I need to be consistent with the categorical / factor encodings of variables. With consistency I mean that I need to assure that the encodings from integers and levels should be the same in the training and the new testing sample.

This answer seems to suggest that it is not necessary. On the contrary, this answer suggests that IT is indeed necessary.

Suppose I have a training sample with an xcat that can take values a, b, c. The expected result is that the y variable will tend to take values close to 1 when xcat is a, 2when xcat is b, and 3 when xcat is c.

First I'll create the dataframe, pass it to h2o and then encode with the function as.factor:

library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)

n = 20
y <- sample(1:3, size = n, replace = T)
xcat <- letters[y]
xnum <- sample(1:10, size = n, replace = T)
y <- dep + rnorm(0, 0.3, n = 20)

df <- data.frame(xcat=xcat, xnum=xnum , y=y)
df.hex <- as.h2o(df, destination_frame="df.hex")

#Encode as factor. You will get: a=1, b=2, c=3
df.hex[ , "xcat"] = as.factor(df.hex[, "xcat"])

Now I'll estimate it with an glm model and predict on the same sample:

x = c("xcat", "xnum")
glm <- h2o.glm( y = c("y"), x = x, training_frame=df.hex, 
               family="gaussian", seed=1234)

glm.fit <- h2o.predict(object=glm, newdata=df.hex)

glm.fit gives the expected results (no surprises here).

Now I'll create a new test dataset that only has a and c, no b value:

xcat2 = c("c", "c", "a")
xnum2 = c(2, 3, 1)
y = c(1, 2, 1) #not really needed
df.test = data.frame(xcat=xcat2, xnum=xnum2, y=y)
df.test.hex <- as.h2o(df.test, destination_frame="df.test.hex")
df.test.hex[ , "xcat"] = as.factor(df.test.hex[, "xcat"])

Running str(df.test.hex$xcat) shows that this time the factor encoding has assigned 2 to c and 1 to a. This looked like it could be trouble, but then the fitting works as expected:

test.fit = h2o.predict(object=glm, newdata=df.test.hex)
test.fit
#gives 2.8, 2.79, 1.21  as expected

What's going on here? Is it that the glm model carries around the information of levels of the x variables so it doesn't mind if the internal encoding is different in the training and the new test data? Is that the general case for all h2o models?

From looking at one of the answers I linked above, it seems that at least some R models do require consistency.

Thanks and best!

You may get more answers on https://stats.stackexchange.com/ on why the need is there. From what I remember, R's glm does require to see the same values. By default, H2O will throw an exception if it sees new values at test time. you can disable that by calling `setConvertUnknownCategoricalLevelsToNa(true)` on `EasyPredictModelWrapper` — wishihadabettername, Nov 28 '17 at 17:41
Also, here's a workaround if you want to use a glm model for data with new factor values (I did it for working with R only, so am not 100% sure it will work but you can try it quickly nonetheless `model <- glm (.....) model$xlevels[["MYFIELD"]] <- union(model$xlevels[["MYFIELD"]], levels(testdata$MYFIELD))`. This will augment the levels in the already trained model with new ones only present in the test data. — wishihadabettername, Nov 28 '17 at 17:46

Consistency of categorical encodings in h2o (and R) for training and new test sample

0 Answers0