Batch_input and elapsed time per iteration slow down during model training
Batch_input and elapsed time per iteration slow down during model training
Arguments
data_impl ....................... mmap........................updated deepspeed_extra_args ............ {'bf16': {'enabled': True}}.updated dynamic_loss_scale .............. True........................updated eval_interval ................... 40000.......................updated eval_iters ...................... 10..........................updated fp32_allreduce .................. True........................updated global_num_gpus ................. 4...........................updated gpt_j_residual .................. True........................updated hidden_size ..................... 768.........................updated init_method ..................... small_init..................updated is_pipe_parallel ................ True........................updated launcher ........................ slurm.......................updated log_interval .................... 10..........................updated lr .............................. 0.0006......................updated lr_decay_iters .................. 143000......................updated lr_decay_style .................. cosine......................updated max_position_embeddings ......... 2048........................updated min_lr .......................... 6e-05.......................updated no_weight_tying ................. True........................updated num_attention_heads ............. 12..........................updated num_layers ...................... 12..........................updated num_workers ..................... 32..........................updated optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.95], 'eps': 1e-08}}updated optimizer_type .................. Adam........................updated output_layer_init_method ........ wang_init...................updated partition_activations ........... True........................updated pipe_parallel_size .............. 1...........................updated pos_emb ......................... rotary......................updated precision ....................... bfloat16....................updated rotary_pct ...................... 0.25........................updated save ............................ /pythia/checkpoints/test_1updated save_iters ...................... [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000]updated seq_length ...................... 2048........................updated sparsity_config ................. {}..........................updated synchronize_each_layer .......... True........................updated test_data_paths ................. ['/pile_0.87_deduped_text_document/pile_0.87_deduped_text_document']updated test_data_weights ............... [1.0].......................updated text_gen_type ................... unconditional...............updated tokenizer_type .................. HFTokenizer.................updated train_batch_size ................ 128.........................updated train_data_paths ................ ['/pile_0.87_deduped_text_document/pile_0.87_deduped_text_document']updated train_data_weights .............. [1.0].......................updated train_iters ..................... 143000......................updated train_micro_batch_size_per_gpu .. 32..........................updated user_script ..................... train.py....................updated valid_data_paths ................ ['pile_0.87_deduped_text_document/pile_0.87_deduped_text_document']updated valid_data_weights .............. [1.0].......................updated vocab_file ....................../pythia/utils/20B_tokenizer.jsonupdated wall_clock_breakdown ............ True........................updated zero_allgather_bucket_size ...... 500000000...................updated zero_contiguous_gradients ....... True........................updated zero_optimization ............... {'stage': 0, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': False, 'load_from_fp32_weights': False}updated zero_reduce_bucket_size ......... 500000000...................updated zero_reduce_scatter ............. True........................updated zero_stage ...................... 0...........................updated account ......................... None........................default activation ...................... gelu........................default activation_checkpointing ........ None........................default adlr_autoresume ................. False.......................default adlr_autoresume_interval ........ 1000........................default amp ............................. None........................default apply_query_key_layer_scaling ... False.......................default attention_dropout ............... 0...........................default attention_softmax_in_fp32 ....... False.......................default autotuning ...................... None........................default autotuning_run .................. None........................default base_shapes_file ................ None........................default bf16 ............................ None........................default bias_dropout_fusion ............. False.......................default bias_gelu_fusion ................ False.......................default char_level_ppl .................. False.......................default checkpoint ...................... None........................default checkpoint_in_cpu ............... False.......................default checkpoint_num_layers ........... 1...........................default checkpoint_scale ................ linear......................default checkpoint_validation_with_forward_pass False................default clip_grad ....................... 1.0.........................default comment ......................... None........................default comms_logger .................... None........................default communication_data_type ......... None........................default compression_training ............ None........................default contiguous_checkpointing ........ False.......................default coord_check ..................... False.......................default create_moe_param_group .......... True........................default csv_monitor ..................... None........................default curriculum_learning ............. None........................default curriculum_seqlen ............... 0...........................default data_efficiency ................. None........................default data_path ....................... None........................default data_types ...................... None........................default deepscale ....................... False.......................default deepscale_config ................ None........................default deepspeed ....................... True........................default deepspeed_activation_checkpointing True......................default deepspeed_mpi ................... False.......................default deepspeed_slurm ................. False.......................default detect_nvlink_pairs ............. False.......................default distributed_backend ............. nccl........................default do_test ......................... None........................default do_train ........................ None........................default do_valid ........................ None........................default dump_state ...................... False.......................default elasticity ...................... None........................default enable_expert_tensor_parallelism False.......................default eod_mask_loss ................... False.......................default eval_results_prefix ............. ............................default eval_tasks ...................... None........................default exclude ......................... None........................default exit_interval ................... None........................default expert_interval ................. 2...........................default extra_save_iters ................ None........................default finetune ........................ False.......................default flops_profiler .................. None........................default force_multi ..................... False.......................default fp16 ............................ None........................default fp16_lm_cross_entropy ........... False.......................default git_hash ........................ 4c426da.....................default gmlp_attn_dim ................... 64..........................default gpt_j_tied ...................... False.......................default gradient_accumulation_steps ..... 1...........................default gradient_clipping ............... 1.0.........................default gradient_noise_scale_cpu_offload False.......................default gradient_noise_scale_n_batches .. 5...........................default gradient_predivide_factor ....... 1.0.........................default hidden_dropout .................. 0...........................default hostfile ........................ None........................default hysteresis ...................... 2...........................default include ......................... None........................default init_method_std ................. 0.02........................default intermediate_size ............... None........................default iteration ....................... None........................default keep_last_n_checkpoints ......... None........................default label_data_paths ................ None........................default layernorm_epsilon ............... 1e-05.......................default layernorm_fusion ................ False.......................default lazy_mpu_init ................... False.......................default load ............................ None........................default local_rank ...................... None........................default log_dir ......................... None........................default log_grad_norm ................... False.......................default log_grad_pct_zeros .............. False.......................default log_gradient_noise_scale ........ False.......................default log_optimizer_states ............ False.......................default log_param_norm .................. False.......................default loss_scale ...................... None........................default loss_scale_window ............... 1000.0......................default make_vocab_size_divisible_by .... 128.........................default mamba_causal_conv_fusion ........ False.......................default mamba_inner_func_fusion ......... False.......................default mamba_selective_fp32_params ..... True........................default mamba_selective_scan_fusion ..... False.......................default mamba_use_bias_in_conv .......... True........................default mamba_use_bias_in_linears ....... False.......................default master_addr ..................... None........................default master_port ..................... 29500.......................default maximum_tokens .................. 64..........................default memory_profiling ................ False.......................default memory_profiling_path ........... None........................default merge_file ...................... None........................default min_scale ....................... 1.0.........................default mlp_type ........................ regular.....................default mmap_warmup ..................... False.......................default model_parallel_size ............. 1...........................default moe_eval_capacity_factor ........ 1.0.........................default moe_expert_parallel_size ........ 1...........................default moe_glu ......................... False.......................default moe_jitter_eps .................. None........................default moe_lbl_in_fp32 ................. False.......................default moe_loss_coeff .................. 0.1.........................default moe_min_capacity ................ 4...........................default moe_num_experts ................. 1...........................default moe_token_dropping .............. False.......................default moe_top_k ....................... 1...........................default moe_train_capacity_factor ....... 1.0.........................default moe_type ........................ megablocks..................default moe_use_residual ................ True........................default mup_attn_temp ................... 1.0.........................default mup_embedding_mult .............. 1.0.........................default mup_init_scale .................. 1.0.........................default mup_output_temp ................. 1.0.........................default mup_rp_embedding_mult ........... 1.0.........................default mup_width_scale ................. 2...........................default no_load_optim ................... False.......................default no_load_rng ..................... False.......................default no_save_optim ................... False.......................default no_save_rng ..................... False.......................default no_ssh_check .................... False.......................default norm ............................ layernorm...................default num_gpus ........................ None........................default num_kv_heads .................... None........................default num_nodes ....................... -1..........................default num_samples ..................... 1...........................default num_unique_layers ............... None........................default onnx_safe ....................... False.......................default opt_pos_emb_offset .............. 0...........................default output_layer_parallelism ........ column......................default override_lr_scheduler ........... False.......................default padded_vocab_size ............... None........................default param_sharing_style ............. grouped.....................default pipe_partition_method ........... type:transformer|mlp........default prescale_gradients .............. False.......................default profile ......................... False.......................default profile_backward ................ False.......................default profile_step_start .............. 10..........................default profile_step_stop ............... 12..........................default prompt_end ...................... ...........................default rank ............................ None........................default recompute ....................... False.......................default return_logits ................... False.......................default rms_norm_epsilon ................ 1e-08.......................default rope_fusion ..................... False.......................default rotary_emb_base ................. 10000.......................default rotary_save_freqs_buffer ........ False.......................default rpe_max_distance ................ 128.........................default rpe_num_buckets ................. 32..........................default s3_chunk_size ................... 104857600...................default s3_path ......................... None........................default sample_input_file ............... None........................default sample_output_file .............. samples.txt.................default save_base_shapes ................ False.......................default scaled_masked_softmax_fusion .... False.......................default scaled_upper_triang_masked_softmax_fusion False..............default scalenorm_epsilon ............... 1e-08.......................default scheduler ....................... None........................default seed ............................ 1234........................default short_seq_prob .................. 0.1.........................default sliding_window_width ............ None........................default soft_prompt_tuning .............. None........................default sparse_attention ................ None........................default sparse_gradients ................ False.......................default split ........................... 969, 30, 1..................default steps_per_print ................. 10..........................default temperature ..................... 0.0.........................default tensorboard ..................... None........................default tensorboard_dir ................. None........................default top_k ........................... 0...........................default top_p ........................... 0.0.........................default use_bias_in_attn_linear ......... True........................default use_bias_in_norms ............... True........................default use_bnb_optimizer ............... False.......................default use_checkpoint_lr_scheduler ..... False.......................default use_cpu_initialization .......... False.......................default use_mup ......................... False.......................default use_qk_layernorm ................ False.......................default use_shared_fs ................... True........................default use_tutel ....................... False.......................default use_wandb ....................... None........................default wandb ........................... None........................default wandb_group ..................... None........................default wandb_host ...................... https://api.wandb.ai........default wandb_init_all_ranks ............ False.......................default wandb_project ................... neox........................default wandb_team ...................... None........................default warmup .......................... 0.01........................default weight_by_num_documents ......... False.......................default weight_decay .................... 0.1.........................default weighted_sampler_alpha .......... 1.0.........................default world_size ...................... None........................default
Environment:
- PyTorch version: 2.3.1
- CUDA version: 12.2
- NCCL version: 2.20.5
Hardware:
- GPU: A100-SXM4-40GB
- CPU: AMD EPYC 7543 32-Core Processor
- Memory: 263793632 kB (total), 195607748 kB (free)
Can you try running without deepspeed? Thanks
Marking as stale. No activity in 60 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.