Preserve ordering of ava_async APIs between multiple threads
In the current prototype, ava_async means the API returns right after it's sent to the API server, when the API server may not receive or execute the API yet.
In the multi-threading scenario, the order of ava_async APIs being executed may be changed and wrong in the API server when there's inter-thread synchronization.
To enforce the execution correctness, ava_async should preserve the ordering of those APIs between guest library and API server.
Related issue: [wait for merge from ava-serverless].
I don't understand how this is possible. The communication channel is ordered and the dispatch is ordered, so reordering can only happen between operations on different threads and those reorderings are allowed for async calls since async calls are definitionally allowed to return before synchonizing with the underlying library.
What am I missing in my understanding?
The mistake happens when the multi-threaded program introduces some other assumptions.
For example, it can assume a CUDA kernel is "enqueued" when a cudaLaunchKernel API call returns. After that, another thread can call cudaEventRecord and cudaStreamWaitEvent to wait for this kernel to finish (but the kernel hasn't been enqueued to the stream, so the second thread doesn't wait), and then call cudaMemcpy API to corrupt the kernel's input CUDA memory (if the kernel hasn't finished).
This error occurred when I was improving the support of more complicated TensorFlow workloads. The work is on another private repo and I'll cc you to a related commit email.
Ok. I see. From a simplistic Lapis semantic perspective this means that cudaLaunchKernel should NOT be ava_async since it performs some action before returning.
It sounds like you are saying that we can get the ordering we need by enforcing some additional ordering on the execution of functions in the server. What ordering are you planning to enforce? I worry that enforcing an ordering will have a surprisingly high performance cost. I suspect that we should find a way to encode the need for this ordering in the spec and only enforce the ordering in cases where it is needed.
I agree that tracking and enforcing a complete ordering is a major performance challenge.
However, the basic idea I suggested to Hangchen is to keep a counter per stream. Every guest API call involving that stream increments the counter. Every API call we forward includes a snapshot of the counter value. The API server then ensures the same order observed by the guest based on that counter value, delaying dispatch of any API calls with discontiguous values.
On 4/7/2020 8:05 PM, Arthur Peters wrote:
Ok. I see. From a simplistic Lapis semantic perspective this means that cudaLaunchKernel should NOT be |ava_async| since it performs some action before returning.
It sounds like you are saying that we can get the ordering we need by enforcing some additional ordering on the execution of functions in the server. What ordering are you planning to enforce? I worry that enforcing an ordering will have a surprisingly high performance cost. I suspect that we should find a way to encode the need for this ordering in the spec and only enforce the ordering in cases where it is needed.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/utcs-scea/ava/issues/2#issuecomment-610694946, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ6DSLLM65JDFVSHBE6EBLRLPEVDANCNFSM4MDLWFUQ.