Can the small factors levels limits be increased from <128?
I was looking into lib/factor/factor_v7.cpp and see code like if (*nrOfLevels < 128). In the comment it says
// use 1 byte per int (Na encoding takes 1 bit)
which seems to be "wasting" the other 7 bits once that one bit is used, technically can support up to 256 distinct values (including NA and NaN).
Without much background, I assume it's to do with how R encodes the values, so it's always stored as int instead of unsigned int. I know if it's too expensive in terms of performance to relax this to 256 by converting to unsigned int. I know Julia supports unsigned with its UInt8 type.
Hi @xiaodaigh, thanks for your question. In an R factor, the value NA is encoded as a NA value in the value vector:
# some factor
x <- factor(sample(LETTERS, 10), levels = LETTERS)
# set factor value to NA
x[5] <- NA
# underlying value is set to NA
as.integer(x)
#> [1] 23 22 16 18 NA 5 8 4 2 26
This could have been done better in R I guess, for example by coding the value 0 as the NA, but with the current implementation that leads to an error:
# create factor manually
y <- c(1L, 2L, 3L, 0L, 4L, 5L)
attr(y, "levels") <- LETTERS
attr(y, "class") <- "factor"
print(y)
#> Error in as.character.factor(x): malformed factor
For performance reasons, fst takes bit 32 from the factor values and adds bit 0-7 to that to get a single byte. So these 7 bits can only be used for < 128 levels. I could also re-code value 0 as an NA, but that would require more processing and would reduce the speed of the filter...
Hope that answers your question!