Changes for basic LLaDA style diffusion masking support

Open gopeshh opened this issue 10 months ago • 0 comments

✨ Description

Cleaned up the code a bit:

Added Diffusion config object as we discussed
removed noise schedules for v1
Moved loss calculation to head.py (as I noticed language modelling loss is computed there)
Moved bidirectional attention to preprocessing.py file as it seems like the attention mask is computed there

Of course still a WIP but feel free to leave comments and suggestions

These are changes to address this PR: https://github.com/ServiceNow/Fast-LLM/issues/208#issue-2950083282

Apr 21 '25 12:04 gopeshh