About 12-bit image compression
I created a function to normalize a 12-bit image to the range [0,1] and use it as input for network. I wonder whether this is sufficient or if certain parameters in the entropy model (such as scale_table, SCALES_SIN=0.11, SCALES_MAX=256, and SCALES_LEVELS=64) also need to be adjusted to accommodate a larger range."
Theoretically, if the bottleneck has enough capacity (i.e., enough channels), it shouldn't be necessary to change the SCALES_* values. For instance, consider two channels c1 and c2 that are both 6-bit (i.e. SCALES_MAX = 64). Then, the first convolution layer in the g_s decoder may combine them into a 12-bit channel by doing:
y_c1_c2_12bit_combination = 2**6 * y[:, c1] + y[:, c2]
This function can be learned by e.g. a 1x1 convolution with the weight vector:
w[:] = 0
w[c1] = 2**6
w[c2] = 1
y_c1_c2_12bit_combination = (w[:, None, None] * y).sum(dim=1)
However, that's just a theoretical equivalence. In practice, you'll have to try it out to see which one leads to a better trained model.
Theoretically, if the bottleneck has enough capacity (i.e., enough channels), it shouldn't be necessary to change the
SCALES_*values. For instance, consider two channelsc1andc2that are both 6-bit (i.e.SCALES_MAX = 64). Then, the first convolution layer in theg_sdecoder may combine them into a 12-bit channel by doing:y_c1_c2_12bit_combination = 2**6 * y[:, c1] + y[:, c2]This function can be learned by e.g. a 1x1 convolution with the weight vector:
w[:] = 0 w[c1] = 2**6 w[c2] = 1 y_c1_c2_12bit_combination = (w[:, None, None] * y).sum(dim=1)However, that's just a theoretical equivalence. In practice, you'll have to try it out to see which one leads to a better trained model.
I am having some difficulty understanding your explanation. Could you clarify it further? For instance, how would one generate two 6-bit channels from a 12-bit image by dividing it, normalizing each channel, feeding them into ga to obtain y, and then summing them up? This approach seems quite similar to directly normalizing the entire 12-bit image—what distinguishes the two methods?
Additionally, I’ve noticed that parameters like the scale table only seem to appear when the model is updated. Could I interpret this as simply a predefined value applied to (y - mean).round() for encoding purposes?
Furthermore, when normalizing a 12-bit input to the range [0, 1] and feeding it into the network ga to generate y, will the resulting distribution of y differ significantly from the distribution generated from 8-bit image ? I am curious if such differences would necessitate modifications to other parts of the model.
If the scale table serves solely as a constraint on (y - mean).round() during encoding, is there any inherent difference between compressing a normalized 12-bit image and a normalized 8-bit image?