Windows implementation is racy
The Windows implementation currently works like this:
- Reserve a chunk of address space large enough to contain an aligned address where our allocation can fit
- Calculate an aligned address in the returned space
- Free the reserved address space
- Allocate memory at the calculated address
Between the last 2 steps, a different thread could reserve the freed address space and make the allocation fail. To avoid wasting memory, the reserved space somehow has to be freed. Maybe there's a way to free a part of the chunk of address space allocated with VirtualAlloc? That way, we could keep the space for the allocation and free the rest.
You can look at msvcrt source code for the correct Windows implementation.
Specifically, _aligned_malloc that calls _aligned_offset_malloc_base, both are in the source file align.c.
Here’s a comment from there that pretty much explains how it works:
/***
*
* |1|___6___|2|3|4|_________5__________|_6_|
*
* 1 -> Pointer to start of the block allocated by malloc.
* 2 -> Value of 1.
* 3 -> Gap used to get 1 aligned on sizeof(void *).
* 4 -> Pointer to the start of data block.
* 4+5 -> Data block.
* 6 -> Wasted memory at rear of data block.
* 6 -> Wasted memory.
*
*******************************************************************************/
This way _aligned_malloc() only calls malloc() once, therefore it’s thread safe.
Besides the thread safety, your current Windows implementation is very inefficient. It wastes lots of RAM. VirtualAlloc can only allocate large chunks of memory, the granularity varies between systems but it’s typically 64kb. So you’re spending at least 64kb per allocation.
Doesn't VirtualAlloc have page granularity (4 KB on most systems)?
Anyway, this crate was originally intended only for large allocations with large alignments, both in the order of hundreds of KB to a few MBs. I didn't want to have an overhead of up to 100% for the alignment padding (when size==alignment), so I opted for a solution that can potentially free up the memory only used for alignment (since this is always some whole number of pages).
Seeing how literally nobody uses this crate like that, I'd be fine with changing the implementation so that it handles smaller alignment values better, which the msvcrt implementation seems to do. Alternatively, we could copy Rust's own implementation as was suggested in https://github.com/jonas-schievink/aligned_alloc.rs/issues/1.
Doesn't VirtualAlloc have page granularity (4 KB on most systems)?
The size granularity is page so you aren’t wasting that much RAM, but the allocation granularity is more, 64kb on my system, so you’re wasting quite a lot of address space. Probably OK on a 64-bit OS but there’re still 32 bit systems out there, especially on mobile and embedded platforms (well, Windows no longer has a mobile one, but it still has embedded, e.g. Win10 IoT).
I didn't want to have an overhead of up to 100% for the alignment padding
That’s a valid concern. But I think that it’s not a good practice anyway allocate that many small objects. And for larger buffers, even for just 10-20 aligned elements, that overhead becomes very small relatively speaking.
we could copy Rust's own implementation
I’ve looked at the code you’ve linked. At the first sight, it looks very similar to what MS is doing in their CRT.