candre23 comments

Results 10 comments of


                                            candre23

Support for Phi-3 models

> Model directly works 👍 Only partially. MS is using some new rope technique they're calling "longrope". As-is, LCPP will work ok for the first few gens but will then...

Support for Sparse MoE models like Camelidae and Sparsetral

Another sparse MoE implimentation: https://github.com/predibase/lorax They make a lot of claims that are big if true. But who doesn't these days? https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4

Runaway memory usage, even with lazy unpickle

I tried a few more times, and it seems to be hard-crashing the machine and causing a reboot every time it fails now. Here's full -v terminal output from shortly...

Runaway memory usage, even with lazy unpickle

Based on the other issue explaining how lazy-unpickle works, I'm wondering if it's not recognizing the format of the 103b stacked models and that's why it's not using that method....

Runaway memory usage, even with lazy unpickle

Having nothing to lose, I tried running this from within WSL (on the same machine) and the merge completed. Memory usage was still quite high - over 40GB and still...

No special token handling in imatrix, beam-search and others

Not sure if this is related to this issue specifically, but iQ3 quants of L3 are definitely broken right now. Strangely, iQ4 quants seem OK. Here's some PPL calcs I...

[Bug]: fatal: couldn't find remote ref --dry-run while looking for updates to extensions

> file: manual_update.bat placed in extensions folder I ran this in the extensions directory and it successfully updated all the extensions that were out of date. However, after hitting the...

Tesla P40 only using 70W underload

P40 weirdness seems to be even stranger than just "it's slow". I wanted to chart VRAM usage for different models at different prompt context sizes, and the results were... impossible?...

Tesla P40 only using 70W underload

Ah, my apologies. I had no idea it was allocating memory for max context, regardless of how much context was actually being fed in. In retrospect, that perfectly explains what...

Tesla P40 only using 70W underload

I'm actually doing this in oobabooga, not exllama proper. My ooba install is up to date, but I have no clue if their implementation is up to date with your...