Per-example & Phatgoose routing

Open oleksost opened this issue 11 months ago • 0 comments

🎯 Goal (What & Why)

When training experts in the so-called "embarrassingly parallel fashion", I observed that per-example oracle routing performs better (each sample is sent to its dedicated expert that is assumed to be known) than full MoE fine-tuning (like BTX) as well as training only the per-token router.

This motivates the need to experiment with:

per-example routing, where router routes based on aggregated token view per sample, each token in a sample receives the same routing weights
implementation of Phatgoose type of router (per token, but expert embeddings are learned independently)

🚀 Execution Plan

Step 1: What is the smallest working version?

Implement a new _per_example_topk_routing function in the MixtureOfExpertMLP class for per-example routing.

As for Phatgoose, we also need to implement a sigmoid gate for each expert, the routing embeddings in this case are learned independently for each expert on its own dedicated expert dataset. At inference time, we use the learned expert embeddings with a standard per-token router.

📌 Acceptance Criteria (Must-Haves for Completion)

The feature must be functional and tested.
The implementation must be documented in practical terms.
No refactors unless directly necessary for feature completion.

🛠️ Project Management

[x] Assign the project to the Fast-LLM project.
[x] Set the Estimate field (in days) in the GitHub project.
[x] Use the Size field to categorize the PR size (Small/Medium/Large).
[ ] Assign an owner when opening the issue.

Mar 10 '25 16:03 oleksost