Bert Maher
Bert Maher
The base BERT model in https://arxiv.org/pdf/1810.04805.pdf uses 768 hidden features, 12 layers, 12 heads (which are also the defaults in `bert.py`), while the default configuration in the argparser of `__main__.py`...
This is in pytorch/hub: https://pytorch.org/hub/nvidia_deeplearningexamples_ssd/. There's some evidence that this can be made much faster (up to 3x) with some graph optimizations: https://paulbridger.com/posts/video-analytics-deepstream-pipeline/
Inside Meta we're pushing to support llvm-15, so need some minor API changes. The biggest difference since 11 is the move to opaque, untyped pointers; I've applied `LLVMContext::setOpaquePointers(false)` to work...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #97028 * #97019 They're sort of independent of the rest of inductor, and this makes them a bit easier to find and...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #97028 * __->__ #97019 Lacking handling of kwargs strikes again. Differential Revision: [D44166740](https://our.internmc.facebook.com/intern/diff/D44166740/) cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe...
While analyzing performance of tf32 gemm on A100, I found a surprising number of stalls on ldmatrix. Looking at the ttgir: ``` local_load tt.dot tt.dot async_copy, etc... local_load ``` suggested...
@Chillee noticed that using `atomic_add` in the backward of attention notably slows down the kernel, and in fact it's slower than "manually" doing `atomic_add` using inline assembly. The root cause...
Fixes #ISSUE_NUMBER cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov