Replace mt_merge blending formula
As mentioned here, mt_merge uses a slightly incorrect formula. Test script:
src = mt_lutspa(expr="x 255 *")
mask = mt_lut(y=-255)
mt_merge(mt_lut(y=-0), src, mask)
mt_lutxy(last, src, "x y - abs 75 *").grayscale()
While output clip must be all zero, it is not.
Vapoursynth's formula
dstp[x] = srcp1[x] + (((srcp2[x] - srcp1[x]) * (maskp[x] > 2 ? maskp[x] + 1 : maskp[x]) + 128) >> 8);
seems to be quite hard if at all possible to get correct in SIMD using 2 bytes/pixel. At this point I'm tempted to say that masktool's approximation is reasonable.
It's the same old formula, but with mask parameter with 0, 1, 2, 4, 5, 6, ..., 256 instead of 0 - 255.
__forceinline static __m128i overlay_blend_sse2_core(const __m128i& p1, const __m128i& p2, const __m128i& mask, const __m128i& v128, const __m128i& v257) {
__m128i tmp1 = _mm_mullo_epi16(_mm_sub_epi16(p2, p1), mask);
__m128i tmp2 = _mm_mulhi_epu16(_mm_add_epi16(tmp1, v128), v257);
return _mm_add_epi16(p1, tmp2);
}
Just a note on reasonably correct implementation. It passes test above but I did not test any more than that.
The idea is that divide by 255 can be done by multiply by 2^16/255 and shift right by 16, hence mulhi_epu16(x, 257).
The problem is actually quite bad.
According to the merge formula resolved for the mask=255 case: result = (ovr<<8 + main - ovr + 128) >> 8 so the result may be ovr+1 or ovr-1 when main-ovr difference is larger than 127 or less than -128. In other words, half the possible outcomes.
It gets worse progressively: when mask=254 result = (ovr<<8 + 2*(main-ovr) + 128) >> 8 which means the thresholds are approx. 64 and -64 so 75% of outcomes are wrong.
Culminating in case when mask=127 or 129 result = (ovr<<8 + 129*(main-ovr) + 128) >> 8 any change in relative luma >= 2 borks the result by 1 (99% of outcomes).
For example, when a static colored image of any color is overlayed on a dynamic video, the overlayed image will change its colors by 1 whenever the underlying video differs from the overlayed picture for more than the mentioned threshold values. Depending on the video such ±1 change can make the overlay image/clip flicker or otherwise get noticeably ugly.
A reliable test for the new merging formula would be overlaying a full-range horizontal gradient on a full-range vertical gradient:
blankclip(256, 1024, 1024, "yv12")
horiz_gradient = mt_lutspa(expr="x 255 *",u=-128,v=-128)
vert_gradient = mt_lutspa(expr="y 255 *",u=-128,v=-128)
mt_merge(vert_gradient, horiz_gradient, mt_lut(y=-255), true)
To make the artifact more realistically obvious and annoying we can make the main video change its luminance randomly and play it back:
blankclip(256, 1024, 1024, "yv12")
horiz_gradient = mt_lutspa(expr="x 255 *",u=-128,v=-128)
vert_gradient = mt_lutspa(expr="y 255 *",u=-128,v=-128).scriptclip("""tweak(bright=rand(255))""")
mt_merge(vert_gradient, horiz_gradient, mt_lut(y=-255), true)