About the y_bpp and the Gaussian entropy model
Hello, I have a question about y_bpp and normalization.
In the implementation of the Gaussian entropy model in CompressAI, y_bpp is computed by estimating the likelihood after normalizing the input:
half = float(0.5)
if means is not None:
values = inputs - means
else:
values = inputs
scales = self.lower_bound_scale(scales)
values = torch.abs(values)
upper = self._standardized_cumulative((half - values) / scales)
lower = self._standardized_cumulative((-half - values) / scales)
likelihood = upper - lower
Does this mean that during actual training, it is not (y - means) / scales but rather torch.abs(y - means) / self.lower_bound_scale(scales) that is fitted to the standard normal distribution?
I need to normalize the latent variable y to obtain a standard spherical normal vector for calculating the spatial correlation.
Because of the symmetry of the normal distribution, values = torch.abs(values) has no effect on the result:
upper - lower
= _standardized_cumulative( z + 0.5) - _standardized_cumulative( z - 0.5)
= _standardized_cumulative(-z + 0.5) - _standardized_cumulative(-z - 0.5)
# Proof left as exercise.
The only reason abs is used here is to improve numerical stability for computing upper - lower. By using abs, the output of standardized_cumulative becomes closer to 0. So for example, 0.000000003 - 0.000000002 gives more precise results than 0.999999998 - 0.999999997.
I think the improvement in numerical stability is probably quite minimal for typical cases, so you can probably omit it if it makes it easier.
Because of the symmetry of the normal distribution,
values = torch.abs(values)has no effect on the result:upper - lower = _standardized_cumulative( z + 0.5) - _standardized_cumulative( z - 0.5) = _standardized_cumulative(-z + 0.5) - _standardized_cumulative(-z - 0.5) # Proof left as exercise.The only reason
absis used here is to improve numerical stability for computingupper - lower. By usingabs, the output ofstandardized_cumulativebecomes closer to0. So for example,0.000000003 - 0.000000002gives more precise results than0.999999998 - 0.999999997.I think the improvement in numerical stability is probably quite minimal for typical cases, so you can probably omit it if it makes it easier.
Contributor
Thank you for your response. So the torch.abs does not impose additional constraints to the distribution of y.
Additionally, I am curious: if I aim to normalize ( y ) to achieve a standard normal distribution, should I take the lower bound of the scale set into account?
I'm not sure what you mean about normalizing y. Could you clarify?
Note that y (minus mean) gets quantized to integer values like [..., -2, 1, 0, 1, 2, ...]. If it's normalized, this makes it hard to represent channels that have low or high entropy. In fact, if y perfectly matches the standard normal distribution, all channels will be expected to have the exact same entropy of roughly 2.1 bits per element. The hyperprior itself won't be effective either, since it would always predict scales=1.
>>> num_bins = 17
>>> bin_centers = np.arange(num_bins) - num_bins // 2
>>> def H(p): return -(p * np.log2(p)).sum()
>>> H(norm.cdf(bin_centers + 0.5) - norm.cdf(bin_centers - 0.5))
2.1048326541776676
The lossless entropy coder works on quantized values symbols = (y - mean).round(). So, symbols is expected to take values like [..., -2, 1, 0, 1, 2, ...]. A few parts of the code make use of this assumption, including:
- Simulating quantization: additive noise adds +/- 0.5; and STE uses
.round(). - The
scale_tableis initialized logarithmically in the range[0.11, 64]. That means the lossless entropy coder supports distributions whose typical number of symbols are somewhere in the range[1, ~1000]. For thescale=0.11, it is expected that all symbols will fit in the mean (0) bin with 9-sigma probability; i.e., non-zero values are very unlikely to occur. (When very-low-probability symbols occur outside the range that is supported, they are instead encoded using the bypass mode of the lossless entropy coder, which encodes such out-of-bounds symbols using exp-golomb coding, IIRC.)
>>> scale_table = np.logspace(np.log10(0.11), np.log10(64))
>>> scale_table
array([ 0.11 , 0.125, 0.143, 0.162, 0.185, 0.211, 0.24 , 0.273,
0.311, 0.354, 0.403, 0.459, 0.523, 0.596, 0.678, 0.772,
0.879, 1.001, 1.14 , 1.299, 1.479, 1.684, 1.917, 2.183,
2.486, 2.831, 3.224, 3.672, 4.181, 4.761, 5.422, 6.174,
7.03 , 8.006, 9.116, 10.381, 11.821, 13.461, 15.329, 17.456,
19.878, 22.635, 25.776, 29.352, 33.424, 38.061, 43.342, 49.355,
56.203, 64. ])
>>> 1 / scale_table[0]
9.090909090909092
The lower_bound_scale = LowerBound(scale_table[0]) acts as a safeguard. It does two things: (i) during training, it terminates any gradients that would nudge the model towards producing scales that are below the lower bound; and (ii) during evaluation, it clips scales that are below the lower bound.
Interestingly, there's no UpperBound enforcing the same things. However, it's presumably much rarer for distributions derived from 8-bit image data to need encoding via such wide high-entropy distributions, so the effect of it would probably be minimal during training. And during evaluation, the table indexes are computed so that the scale=64 distribution gets used for scales above scale=64. (Actually, I'm surprised it doesn't bin scales into indexes using the geometric mean between relevant scales or some other information-theoretically better boundary. Theoretically, if the predicted distributions accurately reflect reality, wouldn't it result in rate savings to encode using distributions that more accurately match the predicted scales/distributions?)
Thank you for your response.
In the transformer-based transform coding (ICLR22), it is noted that the effectiveness of the analysis transform ga can then be evaluated by measuring how much correlation there is among different elements in y. Thus the standardized representation of y (i.e., (y - mean) / scale) is utilized to compute the spatial correlation of y across different positions.
Theoretically, (y - mean) / scale should follow a normal distribution. However, in practice, constraints are applied during BPP calculation, and optimization tends to align it. From your explanation, it appears that torch.abs imposes no direct constraints on the distribution, while the lowerbound function prevents gradients below the specified bound during training.
Therefore, the difference between (y-mean)/scale and (y-mean)/lower bound (scale) may be very small? Correspondingly, if there are some cases below the lower bound, it seems that I should use (y-mean)/lowerbound (scale) instead?