MathThread & Co over-/mis-using volatile variables instead of suitable atomics
There are a number of places where the code makes use of volatile variables, where it should be using variants of atomic ones (the comments even indicate that they should be atomic). The code would work just about all the time, but when it breaks it will break in a very strange way indeed (undefined behaviour style). See J F Bastien - Deprecating volatile for the deep-dive on this.
One attempt at this can be found at https://github.com/Quansight/pnumpy/commit/d13448f5203a7105e510d0fdfbba300013a8adb3 . This still has the fundamental problem of assuming that volatile variables (even when manipulated "atomically") participate in any kind of instruction ordering. If we approximately understand a usage pattern that provides good performance (work stealing?) then I'd suggest we attempt to see the effect of replacing the queue infrastructure with something like one of the folly queues - implementing a complete home-grown solution is probably not worth the pain it'll take to get completely right.