server Decklink: Use v210 in high bitdepth color path

Update decklink consumer to use v210 as pixelformat when using high bit-depth colors
Refactor decklink consumer by introducing a format strategy pattern to separate code paths for 8bpc and 16bpc color frames
Simplify bringing the "HDR" code path to feature parity with the 8bpc path
Bumps the simd instruction set requirements to avx2. I don't know how you feel about this
Better performance than 10RGBXLE

Sep 23 '24 13:09 niklaspandersson

Perhaps important to mention here, the motivation for making this change was that 10RGBXLE doesn't work with all formats, for 4Kp30 and above decklink requires you to use v210.

Oct 04 '24 14:10 deadbeef84

I have no objection to requiring AVX2. Especially as it has been around since 2013, which is longer than the version of opengl we require.

I'll mention that Intel kept making CPUs without AVX2 for quite a while. E.g. this 10th gen Pentium chip launched in 2020 has OpenGL 4.5 but no AVX2: https://www.intel.com/content/www/us/en/products/sku/199285/intel-pentium-gold-g6600-processor-4m-cache-4-20-ghz/specifications.html

Apr 30 '25 01:04 programmerjake

I have no objection to requiring AVX2. Especially as it has been around since 2013, which is longer than the version of opengl we require.

I'll mention that Intel kept making CPUs without AVX2 for quite a while. E.g. this 10th gen Pentium chip launched in 2020 has OpenGL 4.5 but no AVX2: https://www.intel.com/content/www/us/en/products/sku/199285/intel-pentium-gold-g6600-processor-4m-cache-4-20-ghz/specifications.html

Is it possible to have HDR require AVX2 while having completely SDR workflows not require it? I think it is safe to assume that anyone doing HDR is likely to have more modern hardware as that is often a companion to 4k image formats. That approach also would not break functionality for older hardware doing non HDR workflows.

May 23 '25 17:05 zcybercomputinggroup

Is it possible to have HDR require AVX2 while having completely SDR workflows not require it? I think it is safe to assume that anyone doing HDR is likely to have more modern hardware as that is often a companion to 4k image formats. That approach also would not break functionality for older hardware doing non HDR workflows.

IMO we shouldn't link CPU requirements to what kind of video you're processing, e.g. you could be using an older PC as a server (so it doesn't have AVX since Intel decided their 2020 Pentium CPUs must be crippled for no good reason other than market segmentation).

Also, this PR just changes the compile options to generate AVX2 for the whole program, not for any specific video mode.

Additionally, what if you want to run CasparCG on Arm Linux? This PR needs a fallback written in C++ without any intrinsics. Autovectorization could help here... You could also do CPU feature detection and call the AVX2 version if you detect AVX2 if you want to keep that code.

May 23 '25 20:05 programmerjake

I mostly agree with the comments. Especially the fact that native intrinsics means that we'll never be able to build for non-x86 architectures. I opted to solve this by using the SIMDe library on linux. That way, intrinsics will automatically be compiled to whatever works on the target architecture. Works on arm64, also works if you want to make a amd64 build without the avx2 instruction set. If you for some reason want to deploy casparcg on a system with the pentium cpu mentioned above.

The additional commits also add (almost complete) support for sub-regions and ports. Setting dest-x != 0 in a sub-region is not supported yet. Also manually overriding the width and height of the region is not supported, but that should be super simple to add. I've simply not had a need for it yet

Jun 03 '25 10:06 niklaspandersson

I was just going to suggest SIMDe, as it allows for performance boosting intrinsics while allowing future platform flexibility. Finally starting to get 10 bit enabled in casparCG is exciting.

Jun 03 '25 17:06 zcybercomputinggroup

To throw a small spanner into the discussions here, I am hoping this can make it into 2.5 soon. But once 2.5 is released, I would like to start merging (into what will be 2.6) the gl compute shader bits I have been working on. I expect that will replace this cpu conversion code. (Feel free to give an argument for why it shouldn't) So this code and any build quirks it requires could be relatively short-lived

Jun 04 '25 09:06 Julusian

Feel free to give an argument for why it shouldn't

here's a try: it could be less efficient because to send it to the decklink card, it'd likely have to be copied back to system ram after the opengl processing, so the decklink card can output it. it may be more efficient to just leave it in system ram where the decklink card can get to it without extra copying, the format conversion done on the cpu could be just as fast as copying on the cpu, because the cpu would be memory bandwidth limited, leaving plenty of spare compute capacity to do the relatively trivial 8/10 bit conversion. this is assuming the decklink card doesn't support direct dma from gpu memory.

Jun 04 '25 09:06 programmerjake

once 2.5 is released, I would like to start merging (into what will be 2.6) the gl compute shader bits I have been working on. I expect that will replace this cpu conversion code. (Feel free to give an argument for why it shouldn't) So this code and any build quirks it requires could be relatively short-lived

That sounds great to me. I've kept working on this because I've got a client that need the support for ports and sub-regions with 10bit output, and the ETA of a production ready GPU compute path is still not really known.

Jun 04 '25 10:06 niklaspandersson

it could be less efficient because to send it to the decklink card, it'd likely have to be copied back to system ram after the opengl processing, so the decklink card can output it. it may be more efficient to just leave it in system ram where the decklink card can get to it without extra copying

There is no non-gpu path even today. The GPU compute shader will if anything reduce the amount of data copied from the gpu to host.

this is assuming the decklink card doesn't support direct dma from gpu memory

They support NVIDIA GPUDirect for Video and something called AMD Pinned Memory. ...but we don't, obviously.

Jun 04 '25 10:06 niklaspandersson