Turing support
Why is Ampere or Ada (RTX 3000 and RTX 4000 series) required to support this?
Turing (RTX 2000 series) has INT4 tensor cores.
Hi, Marlin does not use any INT4 tensor cores, 4-bit weights are decompressed on-the-fly and then the actual computation is carried out in FP16. The reason Turning is not support is that Marlin heavily relies on the cp.async instruction which was introduced with compute capability 8.0; this allows explicitly fetching global memory in the background while doing other work at the same time, which is crucial to reach peak performance in an FP16xINT4 setting. While you could probably reuse quite some work of Marlin for writing a Turing kernel, some significant changes will likely be necessary.