Aaron Lee comments

Results 8 comments of


                                            Aaron Lee

server: implement GLM-style MTP

Thanks all for the suggestions. Will definitely look to refactor into something nicer once correctness can be established. Right now, still trying to get the graph to compute. Turns out...

server: implement GLM-style MTP

I've gotten to the point where I can get the MTP head to output stuff but managing KV cache with an external call to a separate MTP graph adds an...

server: implement GLM-style MTP

On second thought, building a single augmented graph also doesn't work, because we need the main model's sampled token in the MTP subgraph. We could make some shortcut assumptions, like...

server: implement GLM-style MTP

This commit sort of works, in the sense that it outputs tokens but - I can't guarantee that I didn't break things in the multi-slot case, - the model seems...

server: implement GLM-style MTP

Okay, I believe this commit "works" in that both main model and MTP output both seem correct under my informal test conditions. The model is now about as coherent as...

server: implement GLM-style MTP

> Tried to run it in RP scenario (using Q4 quant), got from 0.07 to 0.11 acceptance rate on swipes (one time unexpectedly got 0.18) (t=0.8, min p 0.05, top...

server: implement GLM-style MTP

Upon a bit of testing on my end in RP/creative writing scenarios, I can't find any obvious issues in terms of correctness with the cache management of this prototype; I...

server: implement GLM-style MTP

> Is work on this still progressing in the background? If not, then what kind of work still remains to be done? Is it mainly cleanup and refactoring? If so,...