Document image format expectations for Bumblebee.Vision.ImageClassification
I would like to contribute some documentation that clarifies the expected image format to Bumblebee.Vision.image_classification. The type t:Bumblebee.Vision.image says:
@type image() :: Nx.Container.t() A term representing an image. Either Nx.Tensor in HWC order or a struct implementing Nx.Container and resolving to such tensor.
However it does not clarify:
- If the image should be resized first to the same size as that used to train the model (224 x 224 for the resnet models?)
- Whether the image data should be
{:u, 8}or some other type (some models suggest data should be in the range[0.0..1.0] - Whether the image can have an alpha layer (reading the code suggests yes, but perhaps that is model dependent)
- Whether the image should be preprocessed? This stack overflow article suggests they should be?
If I can get some guidance I'll write a doc PR.
Hey Kip! The image doesn't need to be particularly normalized, because it first goes through a featurizer. In other words, it's not the direct model input, but plain image as pixels. In fact, the type is Nx.Container.t(), because it may also be a struct that implements Nx.Container, which we already do for StbImage (ref).
A featurizer usually casts to float, resizes, scales into [0.0, 1.0]. Whether an alpha layer is used is usually up to the model configuration. So I think the only generally applicable expectation is that the image values are 0..255.
A PR improving the docs is welcome!
@jonatanklosko, TIL what a featurizer is! I suppose the assumption is also the the channel order is RGB (not BGR). I'll work on a doc PR this weekend. For validation then, the input image has the following assumed characteristics:
- HWC order
- RGB color (channel order, not CMYK or some other color space)
- Alpha channel support is model specific
-
{:u, 8}or{:f, 32}or{:f, 64}data type
Thanks for the continuing education and the great library.
The type is not as strict, pretty much any :u or :s type would do. Other than that sounds good!