libass Make fix_outline always produce correct visual output without aliasing or gaps

I’ve ported my ancient alpha branch onto current master and fixed karaoke, so this should now really solve the problem for good.

This would supersede and close #480. It would fix #76 and fix #473.

My only question is whether this PR’s current approaches to caching and producing ASS_Bitmaps are right.

An alternative approach I considered was to include effect_timing in the cache key and have fix_outline take effect_timing and both alphas and internally process the bitmap in two parts like karaoke does it. This would have required no changes to render_text and avoided phantom ASS_Images. The big drawback would be that \kf syllables would effectively never be cached (except for very slowly-moving \kf).

Ideally, combined glyph/fill bitmaps would be cached separately from border and shadow bitmaps:

After this PR, borders and shadows of identical text that differs only in \1a and/or \2a would share (and avoid recomputing) their glyph bitmaps.
After this PR, border and shadow bitmaps could be optimally shared/cached regardless of karaoke states (by making simply a single alpha value as part of the key).
Regardless of this PR, glyph bitmaps could be shared for identical text that has some border and differs only in \be or \blur or fill-in flags.

To get rid of the phantom transparent ASS_Images that render_text now outputs for borders with \kf, render_glyph could skip images whose alpha is 255 (fully transparent). This could slightly help other cases, too, such as when the script has an explicit 255 alpha. On the other hand, if there is any client out there that wants to do something nontrivial with the ASS_Images it receives such as draw bounding boxes around them, it would then miss these images entirely, which seems undesirable. Alternatively, render_glyph could get an extra argument pointing to a second source bitmap, which would work well but require extra (easy) code changes in render_text.

Mar 04 '21 02:03 astiob

Also, after this PR, this code from https://github.com/libass/libass/pull/371#issuecomment-683486886:

if ((flags & FILTER_NONZERO_BORDER &&
     info->a_pre_fade[0] == 0 &&
     info->a_pre_fade[1] == 0 &&
     info->fade == 0) ||
    ...)
    flags |= FILTER_FILL_IN_BORDER;

technically starts to harm quality if \bord is small (smaller than a pixel), although I haven’t done any tests and wouldn’t be surprised if the difference was unnoticeable.

Mar 04 '21 03:03 astiob

And yes, as mentioned on IRC, this PR does have the issue that it requires divisions. Many divisions. Slow.™ It’s not obvious how to alleviate this, although to me it’s also not necessarily obvious that this is actually a problem.

Mar 04 '21 03:03 astiob

After a long discussion on IRC and some performance tests by @MrSmile, I’ve replaced the integer division code with float division, because it’s actually faster, with an extra slight twist to try and assist autovectorization (which is entirely impossible for integer division on x86). I’ve updated the first commit message accordingly. I’ve also used this opportunity to reduce noisy changes between the two commits.

I’ve also added a code comment where I previously posted a GitHub comment.

Mar 05 '21 18:03 astiob

Fixed the memory bug. (Good catch!)

By the way, 216-vertical-2252.png in ART will need to be regenerated: it exhibits pretty bad aliasing that is fixed by this PR. (But it’s close enough that it earns it a “BAD” success rather than an outright failure: that’s kinda nice.)

Haven’t redone render_bitmap yet, though. I’m still at a loss as to the best way to do it.

Apr 29 '21 23:04 astiob

My biggest question here is what this does to performance, since it potentially runs on quite a lot of data. In practice, are compilers smart enough to extract the blurred check? Does the use of floating-point division have any negative effects? Does this end up taking a significant amount of time on complex files now?

If it's fine on those fronts, I'm 👍 on this.

Aug 12 '21 05:08 rcombs

what this does to performance […] are compilers smart enough to extract the blurred check?

Back when this was discussed on IRC, MrSmile ran tests and saw autovectorization kick in on amd64, which I think implies hoisting of the blurred check. (But I’ll note that autovectorization requires -O3, while we currently default to -O2. Maybe it’s time to revisit that default.) And even if it isn’t hoisted, (as MrSmile also noted,) the branch should be predicted correctly almost all the time. The only potential danger is if a compiler is smart enough to autovectorize in general but fails to hoist this check and therefore fails to vectorize.

Quoting the IRC log with the exact numbers below.

@MrSmile Do you remember what the 10s number quoted at the beginning was? Looking at the log now, I’m not sure what the baseline number is: 10s or 16s? … Oh, was it perhaps 10s vectorized and 16s non-vectorized? Does the old/current code vectorize at all?

18:39:39 < MrSmile> Okay, I managed to create sample for fix_outline()
18:40:26 < MrSmile> after PR #483: 10s -> 26s and fix_outline() 71% in perf
18:40:43 < MrSmile> mostly on idiv instruction
18:50:11 < MrSmile> table-based approach isn't much better: 21s and 63%
18:54:51 < rcombs> how's it do if you switch to float
19:03:41 < MrSmile> much better: 12s and 25%
19:06:37 < MrSmile> but I should probably disable -Ofast
19:09:28 < MrSmile> without -ffast-math there is no sane float -> int conversion
19:23:39 < MrSmile> for -O3 it's the same and looks like for -O2 vectorization is disabled altogether
19:26:44 < MrSmile> the fastest method (at least for x86 with AVX): o[x] = num / (65025.0001f - alpha_g * g[x]) + 0.5f

20:05:08 < Oneric> yes, how much does #483 with floats  impact performance for -fno-fast-math builds?
20:33:16 < MrSmile> 10s -> 12s, 8% -> 23% fix_outline()

00:23:32 < Oneric> I think float-div should be faster than int-div on all CPU-Archs with a hardware FPU
00:25:24 < Chortos-2> MrSmile: so how does it actually fare without the vectorization?
00:25:46 < MrSmile> with -O2?
00:26:26 < Oneric> non -ffast-math, -O2
00:30:48 < MrSmile> FP: 20s 60%, int: 26s 72%
00:32:30 < MrSmile> table: 18s 56%
00:33:58 < Oneric> The 256 or the full 256^2 table?
00:34:09 < MrSmile> 256
00:34:24 < MrSmile> with x2 error from ideal aproach
00:35:14 < MrSmile> o[x] = num * div_table[(65025 + 128 - alpha_g * g[x]) >> 8] >> 10;
00:36:22 < Oneric> For reference: what's the time before the patch with same compiler option? (-O2 (no fast-math) …)
00:43:42 < MrSmile> o[x] = (o[x] > g[x]) ? o[x] - (g[x] / 2) : 0  ===> 16s 46%
00:45:10 < MrSmile> o[x] = (o[x] > g[x]) ? o[x] - g[x] : 0  ===> 14s 36%
00:56:04 < Oneric> thanks. so for default-options and fix_outline heavy samples either
00:56:04 < Oneric> 62% penalty with only int
00:56:04 < Oneric> 25% but pretty exact (float)
00:56:05 < Oneric> 13% with some approximation and table
00:57:42 < Chortos-2> to be clear, is this latest float result with 65025.004f and +0.5f, or with the FFMAX(den,1) & all-integers-and-only-division-in-float approach I wondered about?
00:58:03 < MrSmile> 65025.004f
01:21:38 < MrSmile> o[x] = num / (float) FFMAX(den, 1) + 0.5f;  ===> 20.3s 61%
01:22:42 < Chortos-2> there's no need for the +0.5f if you add den/2 to num first
01:23:46 < MrSmile> o[x] = num / (float) FFMAX(den, 1);  ===> 19.5s 58%
01:24:59 < Chortos-2> overall, it looks like all of these float variants are about the same speed?
01:25:19 < Chortos-2> just the autovectorizability differs
01:26:29 < MrSmile> o[x] = (num + den / 2) / (float) FFMAX(den, 1);  ===> 20.8s 61%
01:28:34 < MrSmile> o[x] = num / (65025.004f - alpha_g * g[x]) + 0.5f;  ===> 19.9s 60%
01:29:01 < MrSmile> slightly better and also can vectorize

Aug 12 '21 21:08 astiob

But I’ll note that autovectorization requires -O3, while we currently default to -O2. Maybe it’s time to revisit that default.

If I'm not mistaken -ftree-vectorize should be sufficient to auto-vectorise assuming the instruction baseline contains vector-operations. (For gcc this is currently enabled by default for -O3 but not -O2, not sure about clang)

Aug 12 '21 21:08 TheOneric

Tentative bottom line, assuming that 10s/16s was resp. vectorized/unvectorized:

The consensus is that hardware float is faster than any equivalent int code in any case.
This PR slows down fix_outline in any case.
In a constructed heavy sample, full rendering takes extra 20–25% time.
Measured alone, fix_outline takes extra 60% time unvectorized (vs unvectorized old code) or extra 250% vectorized (vs vectorized old code, which must’ve been pretty fast indeed!).
The new vectorized code is faster than the old unvectorized code.
Vectorization helps immensely, so we should probably enable it in our default CFLAGS.

Aug 12 '21 22:08 astiob

Oh, was it perhaps 10s vectorized and 16s non-vectorized?

Yes, 10s for -Ofast and 16s for -O2.

But I’ll note that autovectorization requires -O3, while we currently default to -O2. Maybe it’s time to revisit that default.

I'm planning to create assembly version when this PR lands, so It's not a strict necessity. On the other hand I usually build libass with -Ofast and haven't encountered problems so far, maybe it's worth to increase default optimization level, especially for non-x86 architectures.

Aug 13 '21 00:08 MrSmile

Oh man, I’ve just rediscovered that this was in my very first batch of patches that was my first attempt to contribute to libass :joy: Even before other things that I thought came earlier. And even listed as the very first item. And it still isn’t finished.

Aug 23 '22 00:08 astiob

But I’ll note that autovectorization requires -O3, while we currently default to -O2. Maybe it’s time to revisit that default

Update on this: since GCC 12 autovectorization is part of -O2 (with -fvect-cost-model=very-cheap which is switched for -fvect-cost-model=dynamic in -O3). clang appears to also enable some forms of autovectorisation with -O2 already.

Oct 24 '22 15:10 TheOneric

Rebased on master and addressed two nits. (More substantial changes still pending.)

Meanwhile, we’ve run performance tests on several other CPUs. Full times of profile built from this source code run on this test script:

AMD Ryzen 7 5800H (Zen 3; released in early 2021), GCC 12.2.0 (Rev10, Built by MSYS2 project), Windows 11 (x86_64-w64-mingw32, UCRT):

For this test, I recompiled only ass_bitmap.c; the rest of libass used the default -O2. For ass_bitmap.c, all tests included -march=native.

vectorized master: 12.4 s
vectorized (& loop-unswitched) float: 14.1 s (-ffast-math didn’t show a significant difference)
nonvectorized master: 17.4 s
nonvectorized float: 22.1 s
(nonvectorized) integer: 24.9 s, and statistical tests claim it’s significantly different from float, but there was more variance in the measurements than I’d have liked, so I’m not entirely convinced this difference truly isn’t just an unfortunate coincidence

Despite vectorization being on by default in GCC 12, it doesn’t actually show any benefit here at -O2 until two additional flags (which are part of -O3) are enabled: -funswitch-loops -fvect-cost-model=dynamic. No other flags I tried showed any significant difference.

Yeah, turns out -funswitch-loops is only on at -O3 in GCC. So maybe I really should unswitch that blurred check manually… or (gasp) accept mathematical imperfection and get rid of the !blurred path altogether.

On another note, neither GCC nor Clang (according to Godbolt) ever try using 32-by-16-bit integer division instructions in this code on x86; instead, they use 64-by-32-bit division. There’s a chance that they’d be faster, although I’m guessing (without knowing anything) maybe not on recent CPUs.

M1 Max (AArch64/ARMv8.5-A), Apple clang version 14.0.0 (clang-1400.0.29.202), arm64-apple-darwin22.3.0 (tests by @rcombs):

branchless int, -O3:       15.33s user 3.75s system 19.109 total
branching  int, -O3:       16.84s user 3.61s system 20.471 total

branching  int, -O2:       20.27s user 6.26s system 26.648 total
branchless int, -O2:       27.52s user 4.82s system 32.397 total

float, -O3:                 9.23s user 3.01s system 12.307 total
float, -O2 (vectorized):   11.82s user 3.76s system 15.620 total

float, -O3 -fno-vectorize: 18.51s user 4.51s system 23.056 total
float, -Os:                23.89s user 3.20s system 27.107 total
float, -O2 -fno-vectorize: 24.11s user 4.08s system 28.295 total

Elsewhere on the Internet, microbenchmarks have shown ARM64 to have slower float div than int div:

at the bottom of this Stack Overflow answer (reproduced by @rcombs) (also shows AMD dividing int & float at equal speeds)
in this blog post

It is possible that these benchmarks are setup incorrectly per se, or perhaps they’re just not directly applicable to our use case because we do additional arithmetic here besides the division itself (even if not a lot)—or perhaps even because of the need to work around division by zero in one way or another, which adds instructions in the integer case but comes for free with floats.

~~Overall, it seems that float continues to fare better, or at least not worse, than integer division in all cases.~~ Actually, non-vectorized float on the M1 does seem slower than int? Whew, that restores a bit of my sanity.

On the other hand, we should probably have hand-written assembly code for this to avoid depending on finicky compiler optimizations that have an effect this big. And at that point, maybe it would be wiser, actually, to keep the C code based on pure integers? Or not?

We haven’t tried LUTs here (as @MrSmile did last time); maybe we should.

Feb 16 '23 02:02 astiob

I really should unswitch that blurred check manually…

Actually it's not a boolean parameter: there're small enough blurs to be between true and false. I'm thinking of replacing boolean with interpolation parameter (0–255):

unsigned num = 255 * FFMAX(o[x] - g[x], 0);
num = 255 * num + (o[x] * (255 - g[x]) - num) * blurred;

If there's no significant performance impact, I would prefer continuous version instead of two discrete (also it's one asm function instead of two).

neither GCC nor Clang (according to Godbolt) ever try using 32-by-16-bit integer division instructions in this code on x86

It's executed on the same hardware anyway, so mostly instructions size matter in this case.

maybe it would be wiser, actually, to keep the C code based on pure integers?

There's a tendency to vectorize only float divisions, so I think generic float version would be better.

I leave here my old table version if anybody wants to experiment with it.

Feb 16 '23 13:02 MrSmile

Actually it's not a boolean parameter: there're small enough blurs to be between true and false. I'm thinking of replacing boolean with interpolation parameter (0–255)

What would be the meaning of this parameter, and how would you compute it? I don’t know how to assign meaning to anything other than a Boolean.

It's executed on the same hardware anyway, so mostly instructions size matter in this case.

Old CPUs have distinctly lower timings for shorter-operand variants in Agner Fog’s tables. Recent CPUs, though, have them mostly the same (for all lengths except 128-div-64) or even a little worse for the 32-div-16 variant than for 64-div-32; maybe that’s why GCC and Clang aren’t using it.

Feb 16 '23 14:02 astiob

It has just occurred to me that there’s another opportunity to explore here. Most of the pixels have either g[x] == 255, where the entire pixel is covered by the glyph body and the outline is invisible, or g[x] == 0, where the pixel is fully outside the glyph body and contains only the border. In either case, division can be avoided entirely.

Code here. This does prevent any kind of vectorization, but it is almost as fast with integers as the vectorized float version and faster than current unvectorized master on my Ryzen:

vectorized float: 14.1 s
nonvectorized int with short-circuit branches avoiding division: 15.1 s

Feb 16 '23 15:02 astiob

What would be the meaning of this parameter, and how would you compute it?

It would be "effective blurriness". I'm going to do some math to find a good approximation for it depending on blur radius, but we can use only 0 and 255 initially. Simplified lerp formula:

unsigned num = FFMAX(65025 * (o[x] - g[x]) + blurred * (255 - o[x]) * g[x], 0);

It has just occurred to me that there’s another opportunity to explore here.

Not sure that more complicated source worth the gain. I think that better to postpone all optimizations til assembly implementation. On the current stage we should decide a good reference algorithm to simplify future assembly testing. Do float and int branch produce exactly the same results?

Feb 16 '23 17:02 MrSmile

I’ve rerun some tests in a slightly more stable environment. I ran each test 30 times on the Ryzen. Most of the results still annoyingly failed normality tests, so take my significance tests with a ~~grain~~ pile of salt, but I tried dropping some (at worst as many as half) of the worst times to get the rest to resemble normality and analysing just them. Statistical tests keep saying all sets were significantly different, but I don’t trust that the smaller differences were actually caused by the changes in the code:

11.3 s: unconditional vectorized division as in this PR, think (float) num / nextafter_den + 0.5f, with -fno-fast-math (using vdivps)
11.4 s: conditional division with (float) (num + den / 2) / den
11.7 s: conditional division with (float) num / den + 0.5f
11.7 s: unconditional vectorized division as in this PR with -ffast-math (using vrcpps)
13.2 s: conditional division with (num + den / 2) / den in integers

Note how integer division yet again continues to appear slower than floating-point division even when no vectorization is in play.

It would be "effective blurriness". I'm going to do some math to find a good approximation for it depending on blur radius

This is the part I don’t understand: what is the physical meaning of this? what is the reference value you’re going to try approximating? how?

If it’s just “what subjectively looks good”, in all honesty I expect the current blurred branch already probably looks plenty good in all cases. But if it’s something more physically profound, it isn’t obvious to me how anything better can be achieved here with a single parameter per image.

Not sure that more complicated source worth the gain. I think that better to postpone all optimizations til assembly implementation. On the current stage we should decide a good reference algorithm to simplify future assembly testing.

I disagree. Surely we want the fastest code, not the simplest? We’ve already added the float stuff only for this reason, which IMO makes it more complicated: just look at the size of that comment explaining over65025 alone. And if the assembler code is gonna be fast, does that mean we can make the C code slow? In fact, shouldn’t we be doing the opposite and optimizing the C code for maximum performance—but on platforms where we aren’t gonna have any hand-written assembler code? Branches may be “slow”, but they’re surely faster than division, plus some platforms don’t even have hardware division.

By the way, if it seems scarily complicated due to the switch, I could also rewrite it using if (I figured the switch was clearer) and/or even keep the variable initializers unconditional (and hope the optimizer reorders them) and use an inline conditional in the assignment.

And given just how fast it is, maybe we don’t actually need any hand-written assembler code at all?

Then again, I suppose you could hand-write vectorized assembly that also performs this check against the whole vector: if it’s all zeros or all 255s, no division is necessary. That might be faster still.

Do float and int branch produce exactly the same results?

No, not with the nextafter(65025) to avoid the division by zero, which causes some quotients to be rounded towards zero in the float code but away from zero in the integer code. We may or may not be able to replace it by an added conditional somewhere like @rcombs tried in the inline comment above.

All pure-integer variants do match each other, including the conditional-division code I proposed in my last comment, and so does the conditional-division code using float for the division itself.

Speaking of rounding, it probably isn’t even optimal to always round to the nearest integer to begin with: ideally we’d round to the nearby integer that gives the more accurate value for the final colour after both images are blended together. But I haven’t tried doing the maths for this.

On another note, @rcombs reported another speedup (77% total time, or 1.3× total speed) on the M1 Max by doing all computations in float16 in the range 0‥1. Of course, this implies some precision loss, and given that we’re working with just 8 bits of precision to begin with, I’m slightly reluctant to accept it, but it’s probably entirely fine.

Feb 16 '23 20:02 astiob

On the M1:

vectorized float:   9.27s user 3.06s system 99% cpu 12.365 total
conditional int:   11.35s user 2.59s system 99% cpu 13.957 total
conditional float: 11.51s user 2.67s system 99% cpu 14.233 total

Feb 16 '23 21:02 astiob

For this test, I recompiled only ass_bitmap.c; the rest of libass used the default -O2. For ass_bitmap.c, all tests included -march=native.

fyi, gcc has a optimize function attribute to apply a different optimisation level and supposedly also many of the -f… option only to a specific function. Perhaps it’s worth using this to avoid tainting the result by changes in other bitmap functions? For affecting all following functions there’s also pragma GCC optimize("O0") for gcc and #pragma clang optimize off for clang (see fuzz/fuzz.c). I don't know whether clang also supports optimize as a function attribute.

Feb 17 '23 00:02 TheOneric

I suppose you could hand-write vectorized assembly that also performs this check against the whole vector: if it’s all zeros or all 255s, no division is necessary. That might be faster still.

FWIW I’ve now tried and measured this. With a simple 8-pixel handling loop (using AVX2 intrinsics) and with the added conditional skip on “all eight bytes = 0 or all eight bytes = 255”, the time has gone down to 10.4 s (vs 11.3–11.7 s for the unconditional vectorized code). And I won’t be surprised if it can be written better and faster still by someone who is better at SIMD assembly.

Feb 17 '23 02:02 astiob

the time has gone down to 10.4 s

It's mostly due to synthetic nature of sample. I suspect that in real world cases with letters it would be actually worse (for small letters due to never hitting uniform 0/255 case and for large letters due to branch mispredictions).

Actually, even scalar 0/255 variant can be slower on real world cases, better to test it first.

Feb 17 '23 02:02 MrSmile

Summarising 32bit ARM tests, mostly already discussed on IRC: (Skip to the conclusion if it’s too much text)

On all 32bit ARM except armv7ve (extended version of armv7-a), there exists no native instruction for integer division; instead a library routine gets called. Any ARM CPU with a hardware FPU provides an instruction for floating point division.

Since we already concluded a FPU is required to use libass at reasonable performance anyway, we probably don't need to be concerned about performance on software-float CPUs. At this point it might seem like for 32bit ARM, fp would be better since surely a native fp div instruction should be faster than a library routine implementing integer division... However it turns out there is actually at least one common'ish CPUs(+OS) for which this does not hold:

For all tests:

Compiler options applied to the whole build
for libass tests: times are in milliseconds and averages of several runs

32bit tests

Alpine Linux 3.17.2 "armv7" userland (armv7-a with hard-float ABI)
gcc 12.2.1_git20220924-r4
clang 15.0.7
profile settings: start=0, end=3, fps=10
per config identical profile binaries were used across all machines

Config options:

armv7-a+fp+neon_O3_clang        ./configure --enable-profile CC="clang -O3 -march=armv7-a+fp -mfpu=neon" CFLAGS="-O3 -g" --host="armv7-unknown-linux-musl"
armv7-a+fp+neon_O3_gcc          ./configure --enable-profile CC="gcc -O3 -march=armv7-a+fp+simd -mfpu=neon" CFLAGS="-O3 -g" --host="armv7-unknown-linux-musl"
armv7ve+fp+neon_O3_clang        ./configure --enable-profile CC="clang -O3 -march=armv7ve+fp -mfpu=neon" CFLAGS="-O3 -g" --host="armv7-unknown-linux-musl"
armv7ve+fp+neon_O3_gcc          ./configure --enable-profile CC="gcc -O3 -march=armv7ve+fp+simd -mfpu=neon" CFLAGS="-O3 -g" --host="armv7-unknown-linux-musl"

clang doesn't accept +simd; gcc docs state this enables "Advanced SIMD (Neon)"; -mfpu=neon should still ensure it is part of the arch baseline.

.

Average Timings on the Cortex-A73 have a run-to-run variance of <30ms. For the A53 it’s a bit more but not enough to change the relative order between int and float.

Cortex-A53 running a 32bit (RaspberryPi 3 B+) @ 1.4GHz: libass:

armv7-a+fp+neon_O3_clang_float          9173
armv7-a+fp+neon_O3_clang_int-noBranch   6861
armv7-a+fp+neon_O3_gcc_float            9416
armv7-a+fp+neon_O3_gcc_int-noBranch     7148

armv7ve+fp+neon_O3_clang_float          9096
armv7ve+fp+neon_O3_clang_int-noBranch   4924
armv7ve+fp+neon_O3_gcc_float            9413
armv7ve+fp+neon_O3_gcc_int-noBranch     5394

The results have been omited, but even int_withBranching managed to beat float here — even without a native int div instruction!

lemire.me:

armv7ve+fp
    int32_constant  0.007 clock units
    int64           0.295
    float64         0.048
    int32           0.017 (!!)
    int32_viaf64    0.055
    float32         0.045
armv7-a+fp
    int32_constant  0.009 clock units
    int64           0.301
    float64         0.047
    int32           0.096 (!!)
    int32_viaf64    0.055
    float32         0.045

As in previous benchmarks, the results from the simple testcase don't replicate the ordering of actual ass_fix_outline implementations. Just this time it goes in the other direction for armv7-a.

Cortex-A73 running 32bit @ 1.8GHz

armv7-a+fp+neon_O3_clang_float          2360
armv7-a+fp+neon_O3_clang_int-noBranch   3125
armv7-a+fp+neon_O3_gcc_float            2594
armv7-a+fp+neon_O3_gcc_int-noBranch     3421

armv7ve+fp+neon_O3_clang_float          2388
armv7ve+fp+neon_O3_clang_int-noBranch   2154
armv7ve+fp+neon_O3_gcc_float            2597
armv7ve+fp+neon_O3_gcc_int-noBranch     2512

Notes:

We afterwards noticed both compilers never autovectorised the code. Looking over the docs again I discovered:

If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.
Tests on Godbolt still showed no usage of q* or d* registers for the float division even with an additional -funsafe-math-optimizations parameter.
Looking at the disassembled __udivsi3 routine used in armv7-a int implementations, it doesn't seem like it detects availablitiy of and just uses the native instruction.
Yes, the binaries do use hardware floating point instructions.

64bit test

For comparison, the same Cortex-A73 running an aarch64 build:

-O3 -g
clang 15.0.7
bionic libc
profile settings: start=0, end=10, fps=20

Cortex-A73 aarch64 @ 1.8GHz libass:

float:          13906
int_noBranch:   14945
int_withBranch: 19791

I didn't record disassembly for those binaries, but tests on Godbolt suggests no autovectorisation of the float div occured. Perhaps the better float performance is due to fundamental differences between 32bit and 64bit ARM.

lemire.me:

    int32_constant  0.001 clock units
    int64           0.004
    float64         0.010
    int32           0.004
    int32_viaf64    0.010
    float32         0.004

Here too, the simple div-only-function benchmark from lemire does not correlate to the performance of the actual int/float-based ass_fix_outline.

Conclusion?

Cortex-A53’s FPU performs surprisingly poor, falling behind even software integer division. A more recent Cortex-A73 fares better in 32bit mode, beating non-native int and almost matching native int. In 64bit mode float again outperforms (native) int on the A73 (no 64bit tests easily possible atm for the A53). Both tested CPUs support aarch64.

Compilers apparently won't use NEON for autovectorisation by default due to incompatibilities with IEEE fp.

Ordering in simple float/int div benchmarks doesn't map to the same ordering in actual float/int-based ass_fix_outline.

Cortex-A53 constitutes the first instance we've seen where float performs worse than int on real-world hardware with hardware float support. Yet, this will in the end probably depend on the particular model even for 32bit-only CPUs.

As it can go either way this probably doesn't help in deciding on whether to use integers or float for the C implementation.

Feb 18 '23 16:02 TheOneric

Now also results for a genuine 32-bit-only CPU (Cortex-A9, ARMv7-a+fp+neon+sec+mp, no fp16), though I doubt this could handle video playback and non-trivial subs at 720p. Average timings for libass tests vary by about 50ms between sets of runs. Within a run set, sometimes a few runs took 100-250ms longer, presumably due to some service process(es) competing for CPU.

perf settings: start=0, end=3, fps=3

armv7-a+fp+neon_O3_clang_int-noBranch  2506.00
armv7-a+fp+neon_O3_gcc_int-noBranch    2733.33
armv7-a+fp+neon_O3_clang_float         3221.67
armv7-a+fp+neon_O3_gcc_float           3243.33

    short add: 2.729927
    short sub: 2.702403
    short mul: 3.256053
    short div: 6.704432
     long add: 1.623491
     long sub: 1.617539
     long mul: 2.691853
     long div: 5.008054
long long add: 2.876212
long long sub: 3.234222
long long mul: 5.526823
long long div: 8.787050
    float add: 3.226642
    float sub: 3.230616
    float mul: 3.672505
    float div: 5.923851
   double add: 3.224513
   double sub: 3.228489
   double mul: 3.766996
   double div: 8.625440

Feb 19 '23 01:02 TheOneric

Not directly related to the patch itself, but mentioning here since all discussion about future default compiler flags also happened in this PR.

With checkasm --bench I just encountered an instance in libass where -O3 signififcantly regresses performance over -O2 (+ auto-vec) with gcc and slightly with clang (but clang didn't perform as good to begin with):

Rough timing averages ± about one second

gcc 10
======
-O2 -g (default)        75s (does *not* include any auto-vectorisation)
-O3 -g -ffast-math      47s (with and without LTO)
-O3 -g                  48s
-O2 -ftree-vectorize    26s (default cost-model was "cheap"

gcc 12
======
-O2 -g (default)        30s (includes auto-vectorisation with cost-model "very-cheap")
-O2 -g -ftree-vectorize -fvect-cost-modely=cheap
                        27s
-O2 -g -ftree-vectorize -fvect-cost-modely=dynamic
                        25s
-O2 -g -ftree-vectorize -fvect-cost-modely=dynamic -march=native -mtune=native
                        23s
-O3 -g                  51s
-O3 -g -march=native -mtune=native
                        49s

clang-15
========
-O2 -g (default)        34s
-O2 -g + LTO            32s
-O3 -g                  36s
-O3 -g -march=native -mtune=native
                        37s

To be fair, I haven't seen -O3 regresses performance in actual profile runs before — but those all used assembly, so the perf-critical functions tested in checkasm weren't influenced by the compiler to begin with.

Full logs with timings of individual functions for gcc-12 -O(2|3) -g -fvect-cost-model=dynamic and clang-15 -O(2|3) -g in case anyone’s interested: 483_checkasm-bench-logs.tar.gz

Mar 19 '23 00:03 TheOneric

Ultimately, this just means this function is an even stronger candidate for explicit vectorization in assembly.

Mar 19 '23 02:03 rcombs

It would be "effective blurriness". I'm going to do some math to find a good approximation for it depending on blur radius, but we can use only 0 and 255 initially.

@MrSmile Is this still something you’re willing to look into—is it worth implementing this into the code (C and assembly) now? I’m guessing it’s an additional minor slowdown, so it would be a bit of a waste to add it and not use it; but on the other hand, modifying the several implementations across C & assembly later also wouldn’t be too fun.

Apr 07 '24 22:04 astiob

I think for now it would be better to use blur radius (well, we have two, so some average value like (blur_r2x + blur_r2y) / 2) directly instead of boolean blurred. At first we can trigger one of the two implementations based on some separation value (I'll calculate the exact value later). In the future we can switch to more continuous approach.

For 1D case the exact solution would be something like this. Helper function: $$F(x)=\frac x2 + \frac x2\mathrm{erf}\left(\frac x{\sqrt{2}}\right)+\frac1{\sqrt{2\pi}}e^{-\frac{x^2}2}$$ Solve for $x$: $$F\left(x+\frac1r\right)-F(x)=\frac or$$ Calculate output value: $$F\left(x+\frac{1-g}r\right)-F(x)=\frac {o_1}r$$ Here $o$ and $g$ are input values, $o_1$ — output, $r$ — blur radius. I'll try to devise some crude approximation for this later, but I think PR can be merged without that.

UPD: switching between blurred = false and blurred = true should occur at $\sigma\approx0.34$ (standard deviation of gaussian blur) and maximal approximation error (in case model assumptions holds) would be ≤ 32 pixel value.

Apr 08 '24 03:04 MrSmile