Irhum Shafkat
Irhum Shafkat
In the Flux [docs](https://fluxml.ai/Flux.jl/stable/models/basics/), one of the ways in which a model can be constructed is shown as ```julia function linear(in, out) W = randn(out, in) b = randn(out) x...
The Stacks structure introduced in this package (https://chengchingwen.github.io/Transformers.jl/dev/stacks/) is versatile enough that any multi-input multi-output model in the Julia ecosystem could potentially benefit. Opening this issue to suggest that it...
The paper mentions that for VGG-like training, a pretrained model was used. Could a link be provided for the checkpoint file of the pretrained model so the vgglike-sbp.py experiment can...
When using `torch.compile`, we observe the following graph breaks at all TransformerEngine components. This appears to lead to a large number of lookups by TorchDynamo for each subgraph, resulting in...
The original splits over UniRef50 can be found in the original [repo](https://github.com/facebookresearch/esm). Using a subset of them, need to compute either: * the randomized masking perplexity * the pseudo-perplexity with...