Stella Biderman comments

Results 284 comments of


                                            Stella Biderman

Is this compatible with DeepSpeed / ZeRO?

@zhuzilin Yeah, DeepSpeed’s garbage collection is, well, garbage. We actually have an auxiliary tool `tools/kill_all.sh` that kills all DeepSpeed processes across all connected machines because of how common it is...

Is this compatible with DeepSpeed / ZeRO?

@zhuzilin I've set aside time to add support for muP to GPT-NeoX this week and would love to check out your code. Where can I find it? Perhaps you can...

is fused layernorm really better?

Reposting from chat for documentation: EleutherAI found the same and removed fused layernorm from GPT-NeoX: https://github.com/EleutherAI/gpt-neox/pull/428

Add checks to confirm that the checkpoint conversion script works perfectly correct

`git clone https://huggingface.co/bigscience/gpt2-350m-en/tree/megatron-deepspeed` is failing. It says repo not found. I can download the HF version without issue.

Add checks to confirm that the checkpoint conversion script works perfectly correct

> @StellaAthena Have you made progress with this issue? If not, perhaps I'll take a jab at it! > > @stas00 Will the unit test run with the CI? I'm...

New prompts for Shades FR/EN

@jzf2101 Are there additional changes that need to be made to this PR or can it be merged?

Training speed in bf16 mode is slow.

Did your comparison fp16 model use `zero`? I notice that you're not using it here.

Latest DeepSpeed Support

@Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move?

Latest DeepSpeed Support

> > @Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move? > > Small stuff like logging format, some more detailed timers, and the forward hooks...

How to Specify Negative Binomial Distribution as Duration?

It seems to me that the code is structured in a way that fundamentally assumes that durations are discrete. To change this, one would have to change the way durations...