I'm having trouble understanding whether I need to be consistent with the categorical / factor encodings of variables. With consistency I mean that I need to assure that the encodings from integers and levels should be the same in the training and the new testing sample.
This answer seems to suggest that it is not necessary. On the contrary, this answer suggests that IT is indeed necessary.
Suppose I have a training sample with an xcat
that can take values a
, b
, c
. The expected result is that the y
variable will tend to take values close to 1
when xcat
is a
, 2
when xcat
is b
, and 3
when xcat
is c
.
First I'll create the dataframe, pass it to h2o
and then encode with the function as.factor
:
library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
n = 20
y <- sample(1:3, size = n, replace = T)
xcat <- letters[y]
xnum <- sample(1:10, size = n, replace = T)
y <- dep + rnorm(0, 0.3, n = 20)
df <- data.frame(xcat=xcat, xnum=xnum , y=y)
df.hex <- as.h2o(df, destination_frame="df.hex")
#Encode as factor. You will get: a=1, b=2, c=3
df.hex[ , "xcat"] = as.factor(df.hex[, "xcat"])
Now I'll estimate it with an glm
model and predict on the same sample:
x = c("xcat", "xnum")
glm <- h2o.glm( y = c("y"), x = x, training_frame=df.hex,
family="gaussian", seed=1234)
glm.fit <- h2o.predict(object=glm, newdata=df.hex)
glm.fit
gives the expected results (no surprises here).
Now I'll create a new test dataset that only has a
and c
, no b
value:
xcat2 = c("c", "c", "a")
xnum2 = c(2, 3, 1)
y = c(1, 2, 1) #not really needed
df.test = data.frame(xcat=xcat2, xnum=xnum2, y=y)
df.test.hex <- as.h2o(df.test, destination_frame="df.test.hex")
df.test.hex[ , "xcat"] = as.factor(df.test.hex[, "xcat"])
Running str(df.test.hex$xcat)
shows that this time the factor encoding has assigned 2
to c
and 1
to a
. This looked like it could be trouble, but then the fitting works as expected:
test.fit = h2o.predict(object=glm, newdata=df.test.hex)
test.fit
#gives 2.8, 2.79, 1.21 as expected
What's going on here? Is it that the glm
model carries around the information of levels of the x
variables so it doesn't mind if the internal encoding is different in the training and the new test data? Is that the general case for all h2o
models?
From looking at one of the answers I linked above, it seems that at least some R
models do require consistency.
Thanks and best!