Oleksiy Ostapenko issues

Results 10 issues of


                                            Oleksiy Ostapenko

Evaluation on super-NI (cannot reproduce official model's performance published on hf)

Hello, I am interested in the downstream performance on the super-NI test tasks (0-shot). For the model downloaded from hf (https://huggingface.co/tloen/alpaca-lora-7b) I got 38 rouge-L points on super-NI test tasks....

possible bug in ewc_in_rl

possible bug: in ewc_in_rl.py even though I set max_steps=100 (line 303) it still runs for much more steps

bug

Blocksparse.matmul result does not align with torch

Hello, in the following code the result returned by `triton.ops.blocksparse.matmul` and `torch.einsum` do not align (please no`layout` consists of all ones). My understanding is that both outputs should be the...

Apriel SSM/Hybrid

# ✨ Description This pr improves some minor things in SSM/Hybrid classes, adds functionality for loading and exporting Apriel SSM and hybrid SSM models (adds corresponding modeling.py classes), adds `embeddings_lr_scale`...

Support block-modular architecture

# ✨ Description This draft PR addresses #242 by introducing a flexible, modular configuration system for hybrid model architectures. TODOs: - [ ] add more testing to make sure legacy...

[feat] Track entropy and MI of routing distribution for topk MoE

# ✨ Description To better detect potential routing collapse and have a better understanding about the routing distribution, we can track the average entropy and mutual information of routing probabilities....

enhancement

Per-example & Phatgoose routing

# 🎯 **Goal (What & Why)** When training experts in the so-called "embarrassingly parallel fashion", I observed that per-example oracle routing performs better (each sample is sent to its dedicated...

enhancement

Kda mixer

# ✨ Description Should be merged after GDN #392 . Adding KDA mixer from Kimi Lienar. Note, for now this requires nightly triton and pytorch, see: https://github.com/fla-org/flash-linear-attention/blob/main/FAQs.md. TODOs: - [...

[hybrid dev] HF GDN and Kimi Linear layers for vllm benchmarking

# ✨ Description Added Kimi linear and gated DeltaNet layers in order to create checkpoints for vllm throughout benchmarking. This required updating transformers to a newer version (4.57.1). Inference with...

[hybrid_dev] Nemotron-H mamba2

Mamba2 implementation as in nemotron h. This also uses correct mamba2 kernels. Motivation: - we want to use the same mamba2 implementation as in vllm TODOs: - [x] added per...