Support glam SIMD types
Some glam types have an SIMD counterpart. For example there's Vec3 which is scalar and Vec3A which uses SIMD.
Currently, glamour only implements converting to the "normal" (meaning no A suffix) glam types.
For types where only an SIMD implementation exists, such as Vec4, this doesn't matter as they are the "normal" type.
However I'd argue that a 3D vector is probably the most used of all the types, and therefore it should be possible to use the corresponding SIMD counterpart.
From my understanding of the underlying code, it seems like associated 3D vector type for the f32 Scalar implementation is also used for performing the calculations. This means that any operations using Vector3 in this crate would not be SIMD accelerated.
Again, this is how the code looks to me, I might be wrong :)
If I am wrong, it would however still be nice to support directly converting to Vec3A instead of needing to convert to Vec3 and then use glam's API.
I think you are basically correct, but there's a few challenges. Basically, the problem is that Vec3A is both overaligned and oversized, which has a couple of implications, most of all that it cannot be compatible with the Vector3 API in glamour (no vec.x etc. fields can be exposed).
It would be possible to expose a Vector3A<T: Unit<Scalar = f32>> type that matches the requirements, though, but the reason I haven't prioritized this is that the use case for overaligned Vec3 is actually not super convincing, at least in my use cases.
Basically:
- Autovectorization usually handles Vec3 operations just fine (without explicit SIMD intrinsics). If you have a use case where it really doesn't, I'm interested to hear about it.
- It only really matters when you're doing lots of Vec3 math, but that usually looks like performing some operation on a long list of Vec3s. If that list takes up 25% more memory (due to padding), cache effects start playing a role, so there's a balance between the speedup from SIMD acceleration and the slowdown from more frequent cache misses.
- All of this is extremely sensitive to CPU model and feature set. For example, enabling AVX stops the autovectorizer from penalizing unaligned loads, so this has a huge impact.
A lot of this is intuition, so I'm extremely open to input. 😄
[...] most of all that it cannot be compatible with the Vector3 API in glamour (no vec.x etc. fields can be exposed).
glam implements Deref for SIMD types, so it should be possible.
[...] the use case for overaligned Vec3 is actually not super convincing, at least in my use cases.
According to the glam docs Vec3A is mostly faster: "Despite this wasted space the SIMD implementations tend to outperform f32 implementations in mathbench benchmarks."
Autovectorization usually handles Vec3 operations just fine (without explicit SIMD intrinsics). If you have a use case where it really doesn't, I'm interested to hear about it.
Currently, I'm still working on figuring out why glam itself is slower than my math implementation., so I don't have any concrete examples at the moment. This issue was more meant as a general thought, since I recently discovered that not all glam types use SIMD. Unfortunately, I don't know much about SIMD and therefore can't really answer your questions regarding it -.-