LV: Proper alignment of buffers for use with SIMD routines
VisVideo, VisDFT and VisAudio buffers need to be properly aligned in order for SIMD routines to work at maximum efficiency. MMX requires at least 8-byte alignments; SSE/SSE2 and Neon need 16; and finally AVX needs 32.
Neither malloc() nor C++'s new operator will automatically provide such a large alignment. We will have to create a visual_new_aligned() based on _allgned_malloc() on WIndows and memalign() on POSIX. According to some sources, OS X does not have memalign(), but we can count on its malloc() always producing addresses at 16 byte boundaries.
Objects allocated in C++ using new will need a custom allocator and the use of placement new.
Additionally, for buffers holding 2-dimensional data such as VisVIdeo, and where block operations such as blitting are performed row by row, every data row requires a similar alignment. Padding will need to be added to achieve this (while keeping contiguity).
Or, do a byte-by-byte head and tail loop around the SIMD stuff that loops till it aligned the data (or went through it), we need this anyway for small data sets.
For example, if a data set only contains 5 bytes, aligned or not, we can't feed it to SSE or AVX. (I wouldn't write stuff in MMX nowadays), the routines we have remain valid of course.
Also: LIBOIL / ORC.
We probably don't need SIMD for small data sets. The overhead associated with SIMD state saving/switching, and the fact that there's little data to compute makes the effort not really worth the trouble, I would think. Doing proper alignment is also quite a lot simpler than making routines account for any kind of starting address.
Did a few more checks on the Intarwebz; it's basically guaranteed by glibc and VC++ that malloc() returns 8-byte and 16-byte aligned pointers on 32-bit and 64-bit systems respectively.
http://msdn.microsoft.com/en-us/library/ycsb6wwf.aspx http://www.gnu.org/software/libc/manual/html_node/Aligned-Memory-Blocks.html
In regards to small data sets: Indeed no SIMD is needed here, but I don't think that having _small and _large API calls is a bit awkward ;-).
Added visual_mem_alloc_aligned() and visual_mem_free_aligned() in bcf319301dc173f1487a740f1b50f3bb86ec23ee