Doohae Jung issues

Results 6 issues of


                                            Doohae Jung

Purpose of "HiddenLayerExtractor" in electra_pytorch.py

Thanks for opening so many awesome repos publicly! While I was studying the operation of electra, I found "HiddenLayerExtractor" in electra_pytorch.py which somewhat looks like manipulating the output of specific...

Multiply by distributed_factor/8.

Thanks for posting a really nice repo! While I was studying the code, I found that in 'train_dense_encoder.py' line 669 and 691 the following: ''' surrogate = surrogate * (trainer.distributed_factor...

Any example script to run multi-node training for slurm?

Hi, I was trying to run multi-node training on slurm nodes but I have no idea how to configure `composer` arguments and commands. Is there any example script to run...

enhancement

Make qwen2 `attention_qkv_bias` optional

# What does this PR do? Qwen2 and Qwen2-MoE model is forced to add bias to the query, key and value linear projections. However, following the trend with other recent...

Mixtral manual `head_dim`

### Feature request https://github.com/huggingface/transformers/blob/816f4424964c1a1631e303b663fc3d68f731e923/src/transformers/models/mixtral/modeling_mixtral.py#L284 `head_dim` in `mixtral` model is forced to have the value of `hidden_size // num_heads`. However, this it not the case in [`llama` model](https://github.com/huggingface/transformers/blob/e95ea479eebb6e01679907db910b5dc5eb64b3c7/src/transformers/models/llama/modeling_llama.py#L290) or even in...

Feature request

[Question] Model Parallel when Training MiniCPM4-8B

Thank you for releasing the excellent model and work. The paper appears to state that the 8B model was pre-trained natively with a 32K sequence length. I would like to...