Lianke Qin
Lianke Qin
I have the same question here.
> Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear). > > One potential temporary solution is to avoid adding many FpVars...
> AFAIU, the correctness guarantees of that function as written require it to be executed sequentially. > > Theres a couple things I'd like to understand: > > 1. How...
Yeah. And the performance suffers a lot from such a huge amount of symbolic LC. I thought the constant * FpVar is almost for free but it turns out to...
Some microbenchmarking results: constant vector * witness vector length of 10000 : CRS generation time ~60 seconds, prove time ~60 seconds, which is far from almost free. witness vector *...
This is the microbenchmark I wrote: https://github.com/brucechin/vector-dot-product-microbench/tree/master
yeah, we're planning to implement it in the backend using shared memory, which I think is slower than NCCL
In mx.symbol.batchnorm_v1, the operator is a class and I can add NCCL communicator/cudaStream into class private variable. they only need to be initialized once. In mx.symbol.batchnorm, NNVM is introduced and...
btw, A deep neural net consists of more than one BN layer at most time. I'm wondering how to assure the same layer in the DNN across different GPUs can...
I'll try the FStatefulCompute later, thanks for your suggestion.