Benjamin Marie

https://kaitchup.substack.com/ [email protected]

Results 20 comments of


                                            Benjamin Marie

Cannot install on Google Colab

Yes, it works. The requirements must be updated. I expect new installations of Axolotl to be impossible until it's fixed.

XLoRA: training issues, Gradients will be None

Here is my model (Llama 3.1 8B): ``` PeftModelForCausalLM( (base_model): XLoraModel( (lora_model): LoraModel( (model): LlamaForCausalLM( (model): LlamaModel( (embed_tokens): Embedding(128256, 4096) (layers): ModuleList( (0-31): 32 x LlamaDecoderLayer( (self_attn): LlamaFlashAttention2( (q_proj): lora.Linear(...

XLoRA: training issues, Gradients will be None

I added this code: ``` print(xlora_model.print_trainable_parameters()) print("--- Require grad? ----") for name, param in model.named_parameters(): if param.requires_grad: print(name) print("----------------------") ``` It prints: ``` trainable params: 118,372,800 || all params: 8,148,634,048...

[Bug]: sm75 can not serve qwen3 bnb 4bit model

Same issue with the GPTQ versions of Qwen3-30B-A3B.

Rag example not working

The "import from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM" also fails with the same error message. But my installation command is much simpler: pip install intel-extension-for-transformers

unstable results of qwen-72b-instruct on IFEVAL?

That's very interesting and very good news! Thank you for digging into this. This is with the HF backend? I usually ran vLLM since it is much faster. Maybe it...

unstable results of qwen-72b-instruct on IFEVAL?

I wonder whether the problem is not with vLLM rather than with lm_eval. I'll do some more tests.

Gradient accumulation yields worse results than the equivalent batch size

Interesting, I didn't know this. But I don't think it matters, I would be surprised that TRL uses FSDP's reduce-scatter for single GPU training.

Gradient accumulation yields worse results than the equivalent batch size

Sure, it's all in the notebook I linked to in my first post. I ran this notebook on Colab with the A100.

Gradient accumulation yields worse results than the equivalent batch size

Yes: ![image](https://github.com/user-attachments/assets/d25e7b20-7551-47fe-b758-01f750636738) This configuration uses fp32 and adamw_torch.

1
2
›