Bert Maher issues

Results 8 issues of


                                            Bert Maher

Default model sizes are much smaller than BERT base

The base BERT model in https://arxiv.org/pdf/1810.04805.pdf uses 768 hidden features, 12 layers, 12 heads (which are also the defaults in `bert.py`), while the default configuration in the argparser of `__main__.py`...

Add SSD to benchmark model repo

This is in pytorch/hub: https://pytorch.org/hub/nvidia_deeplearningexamples_ssd/. There's some evidence that this can be made much faster (up to 3x) with some graph optimizations: https://paulbridger.com/posts/video-analytics-deepstream-pipeline/

hackathon_candidate

Minimal changes to support llvm-15

Inside Meta we're pushing to support llvm-15, so need some minor API changes. The biggest difference since 11 is the move to opaque, untyped pointers; I've applied `LLVMContext::setOpaquePointers(false)` to work...

[inductor] Move fx-fusion tests to a separate file

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #97028 * #97019 They're sort of independent of the rest of inductor, and this makes them a bit easier to find and...

topic: not user facing

module: inductor

[inductor] Allow `tensors` kwarg in sink_cat_after_pointwise

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #97028 * __->__ #97019 Lacking handling of kwargs strikes again. Differential Revision: [D44166740](https://our.internmc.facebook.com/intern/diff/D44166740/) cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe...

topic: not user facing

module: inductor

ciflow/inductor

topic: not user facing

module: inductor

ciflow/inductor

keep-going

ciflow/rocm

ciflow/inductor-rocm

Bert Maher

Default model sizes are much smaller than BERT base

Add SSD to benchmark model repo

Minimal changes to support llvm-15

[inductor] Move fx-fusion tests to a separate file

[inductor] Allow `tensors` kwarg in sink_cat_after_pointwise

Improve scheduling of prefetched dot

atomic_add slows down attention backwards due to layout conversions

[triton] Update pin for PyTorch 2.6/Triton 3.2