Oleksiy Ostapenko
Oleksiy Ostapenko
Hello, I am interested in the downstream performance on the super-NI test tasks (0-shot). For the model downloaded from hf (https://huggingface.co/tloen/alpaca-lora-7b) I got 38 rouge-L points on super-NI test tasks....
possible bug: in ewc_in_rl.py even though I set max_steps=100 (line 303) it still runs for much more steps
Hello, in the following code the result returned by `triton.ops.blocksparse.matmul` and `torch.einsum` do not align (please no`layout` consists of all ones). My understanding is that both outputs should be the...
# ✨ Description This pr improves some minor things in SSM/Hybrid classes, adds functionality for loading and exporting Apriel SSM and hybrid SSM models (adds corresponding modeling.py classes), adds `embeddings_lr_scale`...
# ✨ Description This draft PR addresses #242 by introducing a flexible, modular configuration system for hybrid model architectures. TODOs: - [ ] add more testing to make sure legacy...
# ✨ Description To better detect potential routing collapse and have a better understanding about the routing distribution, we can track the average entropy and mutual information of routing probabilities....
# 🎯 **Goal (What & Why)** When training experts in the so-called "embarrassingly parallel fashion", I observed that per-example oracle routing performs better (each sample is sent to its dedicated...
# ✨ Description Should be merged after GDN #392 . Adding KDA mixer from Kimi Lienar. Note, for now this requires nightly triton and pytorch, see: https://github.com/fla-org/flash-linear-attention/blob/main/FAQs.md. TODOs: - [...
# ✨ Description Added Kimi linear and gated DeltaNet layers in order to create checkpoints for vllm throughout benchmarking. This required updating transformers to a newer version (4.57.1). Inference with...
Mamba2 implementation as in nemotron h. This also uses correct mamba2 kernels. Motivation: - we want to use the same mamba2 implementation as in vllm TODOs: - [x] added per...