youngrok cha comments

Results 11 comments of


                                            youngrok cha

Add sliding window attention to sdpa in mistral

any plan for pr?

Add sliding window attention to sdpa in mistral

@ArthurZucker Oh you are right. Thanks.

Feature Request: Add Min-P sampling layer

I hope this feature being added any time soon!

There are several multi-image samples

I found them too

[Feature Request] Multi-GPU support

Maybe WHISPER__DEVICE_INDEX or WHISPER__NUM_WORKERS could work?

[Feature Request] Multi-GPU support

It worked but looks like it is not tensor parallelized but just multiple model loaded on each GPU. Am I right?

[BUG] ZERO++ | AssertionError: ZeRO parameter intra parallel group is already initialized

same here

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz)

Partitioned parameters are updated only when ds_secondary_partition_tensor is None by this commit (https://github.com/deepspeedai/DeepSpeed/pull/4906). And ds_secondary_partition_tensors only become None after optimizer.step function is called (that function contains logic that invalidate secondary...

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz)

I'm not sure because this logic is a bit complicated, but IMO, while HfArgumentParser.parse_args_into_dataclasses is executed deepspeed zero3 is enabled by this (https://github.com/huggingface/transformers/blob/v4.51.2/src/transformers/training_args.py#L2046). And while load model by from_pretrained method,...

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz)

This is a minimal reproducing code I can make. Running this code with the command at the bottom can reproduce the issue I encountered :) # deepspeed_init.py ```code: python from...