DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Model doesn't learn with fp16.enabled = true: [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1

Open visionscaper opened this issue 5 years ago • 4 comments

Hello,

I'm trying to train a sequence to sequence model based on RNNs. When not using FP16 training (fp16.enabled = false) training works; my loss decreases. However, when training FP16 enabled (fp16.enabled = true), I continuously get gradient overflows. I have also trained the same model without Deepspeed, but with Pytorch native AMP enabled, this also works like a charm.

Note that without Deepspeed, when training in FP16 mode using Pytorch native AMP, I also didn't use any gradient clipping. So my expectation is that training with Deepspeed in FP16 mode would also not require gradient clipping.

As a reference, after making some small changes, I was also able to train the CIFAR-10 example with Deepspeed in FP16 mode; this works just fine, the loss decreases.

I basically would like to know what could be going wrong here? And if there is some specific setting I could use to resolve this issue?

Further I have tried :

  • multiple variations of settings with respect to the optimiser (Adam, original torch.Adam, eps=1e-16 etc.)
  • using different types of RNNs: torch.nn.GRU and Simple Recurrent Units SRUs
  • Turning off dropout

Unfortunately nothing seemed to help; I keep getting grad overflows with fp16.enabled = true in the Deepspeed config.

The one thing I didn't try yet is to use the Nvidia APEX AMP package option. This is because the SRUs I want to work with are not compatible with this library.

Below I have added two logs, showing the training results of the first 15 full-batch training steps (600 steps with train_micro_batch_size_per_gpu=4 batches, using 3x 1080Ti GPUs).

The first log is with fp16.enabled = false, here everything works fine. The second is with fp16.enabled = true, where you can see that training doesn't work. I print the loss every 200 iterations.

f16.enabled = false

2021-01-26 21:56:21,925] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
[2021-01-26 21:56:21,927] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
[2021-01-26 21:56:21,933] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/freddy/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.45142388343811035 seconds
[2021-01-26 21:56:22,527] [INFO] [engine.py:518:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2021-01-26 21:56:22,527] [INFO] [engine.py:521:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    weight_decay: 0.0
)
[2021-01-26 21:56:22,527] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    weight_decay: 0.0
)
[2021-01-26 21:56:22,527] [INFO] [engine.py:386:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2021-01-26 21:56:22,528] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2021-01-26 21:56:22,528] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0002], mom=[[0.9, 0.999]]
[2021-01-26 21:56:22,528] [INFO] [config.py:705:print] DeepSpeedEngine configuration:
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f310d079310>
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   allreduce_always_fp32 ........ False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   amp_enabled .................. False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   amp_params ................... False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   disable_allgather ............ False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   dump_state ................... False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   dynamic_loss_scale_args ...... None
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   elasticity_enabled ........... False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   fp16_enabled ................. False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   global_rank .................. 0
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   gradient_accumulation_steps .. 40
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   gradient_clipping ............ 0.0
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   gradient_predivide_factor .... 1.0
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   initial_dynamic_scale ........ 4294967296
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   loss_scale ................... 0
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   memory_breakdown ............. False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   optimizer_legacy_fusion ...... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   optimizer_name ............... adam
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   optimizer_params ............. {'lr': 0.0002, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0, 'adam_w_mode': True}
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   pld_enabled .................. False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   pld_params ................... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   prescale_gradients ........... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   scheduler_name ............... None
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   scheduler_params ............. None
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   sparse_attention ............. None
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   sparse_gradients_enabled ..... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   steps_per_print .............. 10
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   tensorboard_enabled .......... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   tensorboard_output_path ...... 
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   train_batch_size ............. 480
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   train_micro_batch_size_per_gpu  4
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   wall_clock_breakdown ......... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   world_size ................... 3
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   zero_allow_untested_optimizer  False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   zero_config .................. {
    "allgather_bucket_size": 500000000,
    "allgather_partitions": true,
    "contiguous_gradients": false,
    "cpu_offload": false,
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": false,
    "reduce_bucket_size": 500000000,
    "reduce_scatter": true,
    "stage": 0
}
[2021-01-26 21:56:22,530] [INFO] [config.py:709:print]   zero_enabled ................. False
[2021-01-26 21:56:22,530] [INFO] [config.py:709:print]   zero_optimization_stage ...... 0
[2021-01-26 21:56:22,530] [INFO] [config.py:715:print]   json = {
    "fp16":{
        "enabled":false,
        "hysteresis":2,
        "initial_scale_power":32,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "optimizer":{
        "params":{
            "adam_w_mode":true,
            "betas":[
                0.9,
                0.999
            ],
            "eps":1e-08,
            "lr":0.0002,
            "weight_decay":0.0
        },
        "type":"Adam"
    },
    "steps_per_print":10,
    "train_batch_size":480,
    "train_micro_batch_size_per_gpu":4,
    "wall_clock_breakdown":false
}
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.502511739730835 seconds
[2021-01-26 21:56:22,579] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    weight_decay: 0.0
)
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.5021593570709229 seconds
[2021-01-26 21:56:22,582] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    weight_decay: 0.0
)
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/freddy/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.46016359329223633 seconds
Loading extension module utils...
Time to load utils op: 0.5020427703857422 seconds
Loading extension module utils...
Time to load utils op: 0.5018594264984131 seconds
[0] Loss : 8.999998092651367
[0] Loss : 9.15445327758789
[0] Loss : 9.038656234741211
[2021-01-26 21:56:27,111] [INFO] [timer.py:166:stop] 0/10, SamplesPerSec=41.825688446011114
[2021-01-26 21:56:29,632] [INFO] [timer.py:166:stop] 0/20, SamplesPerSec=44.92055533780422
[2021-01-26 21:56:32,695] [INFO] [timer.py:166:stop] 0/30, SamplesPerSec=42.71682476866671
[2021-01-26 21:56:36,806] [INFO] [timer.py:166:stop] 0/40, SamplesPerSec=38.09388969748721
[2021-01-26 21:56:39,871] [INFO] [timer.py:166:stop] 0/50, SamplesPerSec=38.32450114890898
[2021-01-26 21:56:43,123] [INFO] [timer.py:166:stop] 0/60, SamplesPerSec=38.08162107383462
[2021-01-26 21:56:46,116] [INFO] [timer.py:166:stop] 0/70, SamplesPerSec=38.37604235762581
[2021-01-26 21:56:50,138] [INFO] [timer.py:166:stop] 0/80, SamplesPerSec=37.02720955340785
[2021-01-26 21:56:53,651] [INFO] [timer.py:166:stop] 0/90, SamplesPerSec=36.68486886253886
[2021-01-26 21:56:56,666] [INFO] [timer.py:166:stop] 0/100, SamplesPerSec=36.98690788337102
[2021-01-26 21:56:59,704] [INFO] [timer.py:166:stop] 0/110, SamplesPerSec=37.2128951188848
[2021-01-26 21:57:03,404] [INFO] [timer.py:166:stop] 0/120, SamplesPerSec=36.75958540038136
[2021-01-26 21:57:06,613] [INFO] [timer.py:166:stop] 0/130, SamplesPerSec=36.813475069881996
[2021-01-26 21:57:09,089] [INFO] [timer.py:166:stop] 0/140, SamplesPerSec=37.47199109015046
[2021-01-26 21:57:12,203] [INFO] [timer.py:166:stop] 0/150, SamplesPerSec=37.547663000349296
[2021-01-26 21:57:15,901] [INFO] [timer.py:166:stop] 0/160, SamplesPerSec=37.18199497187463
[2021-01-26 21:57:19,326] [INFO] [timer.py:166:stop] 0/170, SamplesPerSec=37.050755020731636
[2021-01-26 21:57:22,658] [INFO] [timer.py:166:stop] 0/180, SamplesPerSec=36.99534049157816
[2021-01-26 21:57:26,399] [INFO] [timer.py:166:stop] 0/190, SamplesPerSec=36.699293495681665
[2021-01-26 21:57:30,106] [INFO] [timer.py:166:stop] 0/200, SamplesPerSec=36.45669125272261
[200] Loss : 6.800840377807617
[200] Loss : 6.5157470703125
[200] Loss : 6.383593559265137
[2021-01-26 21:57:33,361] [INFO] [timer.py:166:stop] 0/210, SamplesPerSec=36.47919520372467
[2021-01-26 21:57:36,581] [INFO] [timer.py:166:stop] 0/220, SamplesPerSec=36.51791265257378
[2021-01-26 21:57:39,454] [INFO] [timer.py:166:stop] 0/230, SamplesPerSec=36.72364568413719
[2021-01-26 21:57:43,891] [INFO] [timer.py:166:stop] 0/240, SamplesPerSec=36.18249218460241
[2021-01-26 21:57:47,133] [INFO] [timer.py:166:stop] 0/250, SamplesPerSec=36.217873588699675
[2021-01-26 21:57:49,918] [INFO] [timer.py:166:stop] 0/260, SamplesPerSec=36.44586303725491
[2021-01-26 21:57:53,301] [INFO] [timer.py:166:stop] 0/270, SamplesPerSec=36.41131573204769
[2021-01-26 21:57:56,770] [INFO] [timer.py:166:stop] 0/280, SamplesPerSec=36.34533813259211
[2021-01-26 21:57:59,588] [INFO] [timer.py:166:stop] 0/290, SamplesPerSec=36.53348492078434
[2021-01-26 21:58:02,475] [INFO] [timer.py:166:stop] 0/300, SamplesPerSec=36.68489320878963
[2021-01-26 21:58:05,754] [INFO] [timer.py:166:stop] 0/310, SamplesPerSec=36.68418674629942
[2021-01-26 21:58:11,375] [INFO] [timer.py:166:stop] 0/320, SamplesPerSec=35.87610928521094
[2021-01-26 21:58:14,369] [INFO] [timer.py:166:stop] 0/330, SamplesPerSec=35.99301471144916
[2021-01-26 21:58:17,638] [INFO] [timer.py:166:stop] 0/340, SamplesPerSec=36.015777056006975
[2021-01-26 21:58:19,991] [INFO] [timer.py:166:stop] 0/350, SamplesPerSec=36.32426130453177
[2021-01-26 21:58:24,502] [INFO] [timer.py:166:stop] 0/360, SamplesPerSec=35.95910636974815
[2021-01-26 21:58:27,445] [INFO] [timer.py:166:stop] 0/370, SamplesPerSec=36.07671240791601
[2021-01-26 21:58:30,814] [INFO] [timer.py:166:stop] 0/380, SamplesPerSec=36.06615135814506
[2021-01-26 21:58:33,900] [INFO] [timer.py:166:stop] 0/390, SamplesPerSec=36.13561707101366
[2021-01-26 21:58:37,243] [INFO] [logging.py:60:log_dist] [Rank 0] step=10, skipped=0, lr=[0.0002], mom=[[0.9, 0.999]]
[2021-01-26 21:58:37,745] [INFO] [timer.py:166:stop] 0/400, SamplesPerSec=35.994666477704214
[400] Loss : 5.665283203125
[400] Loss : 5.826916694641113
[400] Loss : 6.230928897857666
[2021-01-26 21:58:41,046] [INFO] [timer.py:166:stop] 0/410, SamplesPerSec=36.0048993901217
[2021-01-26 21:58:44,993] [INFO] [timer.py:166:stop] 0/420, SamplesPerSec=35.848783717439545
[2021-01-26 21:58:47,882] [INFO] [timer.py:166:stop] 0/430, SamplesPerSec=35.96537794509143
[2021-01-26 21:58:50,543] [INFO] [timer.py:166:stop] 0/440, SamplesPerSec=36.13387347202367
[2021-01-26 21:58:54,031] [INFO] [timer.py:166:stop] 0/450, SamplesPerSec=36.094787379021554
[2021-01-26 21:58:57,270] [INFO] [timer.py:166:stop] 0/460, SamplesPerSec=36.11669077796843
[2021-01-26 21:59:01,135] [INFO] [timer.py:166:stop] 0/470, SamplesPerSec=35.99268381574188
[2021-01-26 21:59:04,756] [INFO] [timer.py:166:stop] 0/480, SamplesPerSec=35.929526864717516
[2021-01-26 21:59:08,056] [INFO] [timer.py:166:stop] 0/490, SamplesPerSec=35.93959961536687
[2021-01-26 21:59:11,585] [INFO] [timer.py:166:stop] 0/500, SamplesPerSec=35.90001015391583
[2021-01-26 21:59:14,563] [INFO] [timer.py:166:stop] 0/510, SamplesPerSec=35.97882584862603
[2021-01-26 21:59:18,774] [INFO] [timer.py:166:stop] 0/520, SamplesPerSec=35.798695039910946
[2021-01-26 21:59:21,891] [INFO] [timer.py:166:stop] 0/530, SamplesPerSec=35.847454337742384
[2021-01-26 21:59:25,483] [INFO] [timer.py:166:stop] 0/540, SamplesPerSec=35.80024361048666
[2021-01-26 21:59:28,410] [INFO] [timer.py:166:stop] 0/550, SamplesPerSec=35.88450805809668
[2021-01-26 21:59:31,547] [INFO] [timer.py:166:stop] 0/560, SamplesPerSec=35.92561504682371
[2021-01-26 21:59:34,450] [INFO] [timer.py:166:stop] 0/570, SamplesPerSec=36.00956924962096
[2021-01-26 21:59:37,708] [INFO] [timer.py:166:stop] 0/580, SamplesPerSec=36.02444682541163
[2021-01-26 21:59:40,886] [INFO] [timer.py:166:stop] 0/590, SamplesPerSec=36.0539303626204
[2021-01-26 21:59:45,092] [INFO] [timer.py:166:stop] 0/600, SamplesPerSec=35.89674910436057
[600] Loss : 4.872910976409912
[600] Loss : 5.67504358291626
[600] Loss : 4.949432373046875

f16.enabled = true

[2021-01-26 22:03:00,961] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
[2021-01-26 22:03:00,974] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
[2021-01-26 22:03:00,983] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/freddy/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.4628634452819824 seconds
[2021-01-26 22:03:01,585] [INFO] [engine.py:603:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
[2021-01-26 22:03:01,595] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    step: 1
    weight_decay: 0.0
)
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.5023303031921387 seconds
[2021-01-26 22:03:01,624] [INFO] [engine.py:518:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2021-01-26 22:03:01,625] [INFO] [engine.py:521:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    weight_decay: 0.0
)
[2021-01-26 22:03:01,625] [INFO] [engine.py:603:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
Loading extension module fused_adam...
Time to load fused_adam op: 0.5022878646850586 seconds
[2021-01-26 22:03:01,630] [INFO] [engine.py:603:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
[2021-01-26 22:03:01,630] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    step: 1
    weight_decay: 0.0
)
[2021-01-26 22:03:01,631] [INFO] [engine.py:386:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2021-01-26 22:03:01,631] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2021-01-26 22:03:01,631] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0002], mom=[[0.9, 0.999]]
[2021-01-26 22:03:01,631] [INFO] [config.py:705:print] DeepSpeedEngine configuration:
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f5d487266d0>
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   allreduce_always_fp32 ........ False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   amp_enabled .................. False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   amp_params ................... False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   disable_allgather ............ False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   dump_state ................... False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   dynamic_loss_scale_args ...... {'init_scale': 64, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   elasticity_enabled ........... False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   fp16_enabled ................. True
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   global_rank .................. 0
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   gradient_accumulation_steps .. 40
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   gradient_clipping ............ 0.0
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   gradient_predivide_factor .... 1.0
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   initial_dynamic_scale ........ 64
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   loss_scale ................... 0
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   memory_breakdown ............. False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   optimizer_legacy_fusion ...... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   optimizer_name ............... adam
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   optimizer_params ............. {'lr': 0.0002, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0, 'adam_w_mode': True}
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   pld_enabled .................. False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   pld_params ................... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   prescale_gradients ........... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   scheduler_name ............... None
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   scheduler_params ............. None
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   sparse_attention ............. None
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   sparse_gradients_enabled ..... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   steps_per_print .............. 10
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   tensorboard_enabled .......... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   tensorboard_output_path ...... 
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   train_batch_size ............. 480
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   train_micro_batch_size_per_gpu  4
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   wall_clock_breakdown ......... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   world_size ................... 3
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   zero_allow_untested_optimizer  False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   zero_config .................. {
    "allgather_bucket_size": 500000000,
    "allgather_partitions": true,
    "contiguous_gradients": false,
    "cpu_offload": false,
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": false,
    "reduce_bucket_size": 500000000,
    "reduce_scatter": true,
    "stage": 0
}
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   zero_enabled ................. False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   zero_optimization_stage ...... 0
[2021-01-26 22:03:01,633] [INFO] [config.py:715:print]   json = {
    "fp16":{
        "enabled":true,
        "hysteresis":2,
        "initial_scale_power":6,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "optimizer":{
        "params":{
            "adam_w_mode":true,
            "betas":[
                0.9,
                0.999
            ],
            "eps":1e-08,
            "lr":0.0002,
            "weight_decay":0.0
        },
        "type":"Adam"
    },
    "steps_per_print":10,
    "train_batch_size":480,
    "train_micro_batch_size_per_gpu":4,
    "wall_clock_breakdown":false
}
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
[2021-01-26 22:03:01,636] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    step: 1
    weight_decay: 0.0
)
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/freddy/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.46137571334838867 seconds
Loading extension module utils...
Time to load utils op: 0.5019605159759521 seconds
Loading extension module utils...
Time to load utils op: 0.5017917156219482 seconds
[0] Loss : 9.222114562988281
[0] Loss : 9.164878845214844
[0] Loss : 9.16104507446289
[2021-01-26 22:03:06,380] [INFO] [timer.py:166:stop] 0/10, SamplesPerSec=33.97458149325142
[2021-01-26 22:03:10,020] [INFO] [timer.py:166:stop] 0/20, SamplesPerSec=33.436643374152254
[2021-01-26 22:03:14,097] [INFO] [timer.py:166:stop] 0/30, SamplesPerSec=31.906658939683734
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 0
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 0
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 64 to 32.0
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 64 to 32.0
[2021-01-26 22:03:18,711] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 64, reducing to 32.0
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 0
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 64 to 32.0
[2021-01-26 22:03:18,712] [INFO] [timer.py:166:stop] 0/40, SamplesPerSec=30.12004606159127
[2021-01-26 22:03:21,767] [INFO] [timer.py:166:stop] 0/50, SamplesPerSec=31.665849924273356
[2021-01-26 22:03:25,613] [INFO] [timer.py:166:stop] 0/60, SamplesPerSec=31.59344673408403
[2021-01-26 22:03:29,348] [INFO] [timer.py:166:stop] 0/70, SamplesPerSec=31.677591534884105
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 1
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 1
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 1
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0
[2021-01-26 22:03:35,362] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32.0, reducing to 16.0
[2021-01-26 22:03:35,363] [INFO] [timer.py:166:stop] 0/80, SamplesPerSec=29.463939879601927
[2021-01-26 22:03:39,386] [INFO] [timer.py:166:stop] 0/90, SamplesPerSec=29.50914591740365
[2021-01-26 22:03:42,581] [INFO] [timer.py:166:stop] 0/100, SamplesPerSec=30.174574428668237
[2021-01-26 22:03:46,681] [INFO] [timer.py:166:stop] 0/110, SamplesPerSec=30.092707542682465
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 2
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 2
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0
[2021-01-26 22:03:52,492] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16.0, reducing to 8.0
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 2
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0
[2021-01-26 22:03:52,493] [INFO] [timer.py:166:stop] 0/120, SamplesPerSec=28.97346976173917
[2021-01-26 22:03:55,650] [INFO] [timer.py:166:stop] 0/130, SamplesPerSec=29.524899679768
[2021-01-26 22:03:59,609] [INFO] [timer.py:166:stop] 0/140, SamplesPerSec=29.58340685172515
[2021-01-26 22:04:03,811] [INFO] [timer.py:166:stop] 0/150, SamplesPerSec=29.515393227502443
[2021-01-26 22:04:09,190] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 3
[2021-01-26 22:04:09,190] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0
[2021-01-26 22:04:09,190] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 3
[2021-01-26 22:04:09,190] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 3
[2021-01-26 22:04:09,191] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0
[2021-01-26 22:04:09,191] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0
[2021-01-26 22:04:09,191] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8.0, reducing to 4.0
[2021-01-26 22:04:09,191] [INFO] [timer.py:166:stop] 0/160, SamplesPerSec=28.925970601141177
[2021-01-26 22:04:13,267] [INFO] [timer.py:166:stop] 0/170, SamplesPerSec=28.958373700812057
[2021-01-26 22:04:16,777] [INFO] [timer.py:166:stop] 0/180, SamplesPerSec=29.21181410836682
[2021-01-26 22:04:20,485] [INFO] [timer.py:166:stop] 0/190, SamplesPerSec=29.366318279309837
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 4
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 4.0 to 2.0
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 4
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 4
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 4.0 to 2.0
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 4.0 to 2.0
[2021-01-26 22:04:26,297] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4.0, reducing to 2.0
[2021-01-26 22:04:26,297] [INFO] [timer.py:166:stop] 0/200, SamplesPerSec=28.755044316105973
[200] Loss : 9.228988647460938
[200] Loss : 9.030767440795898
[200] Loss : 9.148602485656738
[2021-01-26 22:04:30,651] [INFO] [timer.py:166:stop] 0/210, SamplesPerSec=28.697343434387207
[2021-01-26 22:04:34,646] [INFO] [timer.py:166:stop] 0/220, SamplesPerSec=28.758270700378173
[2021-01-26 22:04:39,122] [INFO] [timer.py:166:stop] 0/230, SamplesPerSec=28.668776412175756
[2021-01-26 22:04:42,789] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 5
[2021-01-26 22:04:42,789] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 5
[2021-01-26 22:04:42,789] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 2.0 to 1.0
[2021-01-26 22:04:42,790] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 2.0 to 1.0
[2021-01-26 22:04:42,789] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 5
[2021-01-26 22:04:42,790] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 2.0, reducing to 1.0
[2021-01-26 22:04:42,790] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 2.0 to 1.0
[2021-01-26 22:04:42,790] [INFO] [timer.py:166:stop] 0/240, SamplesPerSec=28.820362728559836
[2021-01-26 22:04:47,568] [INFO] [timer.py:166:stop] 0/250, SamplesPerSec=28.65161559940446
[2021-01-26 22:04:50,860] [INFO] [timer.py:166:stop] 0/260, SamplesPerSec=28.892728584551964
[2021-01-26 22:04:55,465] [INFO] [timer.py:166:stop] 0/270, SamplesPerSec=28.77793751011365
[2021-01-26 22:04:59,154] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 6
[2021-01-26 22:04:59,154] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 6
[2021-01-26 22:04:59,154] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1.0 to 1
[2021-01-26 22:04:59,154] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1.0 to 1
[2021-01-26 22:04:59,154] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1.0, reducing to 1
[2021-01-26 22:04:59,154] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 6
[2021-01-26 22:04:59,155] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1.0 to 1
[2021-01-26 22:04:59,155] [INFO] [timer.py:166:stop] 0/280, SamplesPerSec=28.8990514991402
[2021-01-26 22:05:02,569] [INFO] [timer.py:166:stop] 0/290, SamplesPerSec=29.07988592190228
[2021-01-26 22:05:06,048] [INFO] [timer.py:166:stop] 0/300, SamplesPerSec=29.235383667331682
[2021-01-26 22:05:10,700] [INFO] [timer.py:166:stop] 0/310, SamplesPerSec=29.110768684030617
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 7
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 7
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 7
[2021-01-26 22:05:16,472] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:16,473] [INFO] [timer.py:166:stop] 0/320, SamplesPerSec=28.750032872073373
[2021-01-26 22:05:20,190] [INFO] [timer.py:166:stop] 0/330, SamplesPerSec=28.84761905966786
[2021-01-26 22:05:23,689] [INFO] [timer.py:166:stop] 0/340, SamplesPerSec=28.985128476811887
[2021-01-26 22:05:27,258] [INFO] [timer.py:166:stop] 0/350, SamplesPerSec=29.10193786516919
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 8
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 8
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 8
[2021-01-26 22:05:35,108] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:35,108] [INFO] [timer.py:166:stop] 0/360, SamplesPerSec=28.38650866746092
[2021-01-26 22:05:38,939] [INFO] [timer.py:166:stop] 0/370, SamplesPerSec=28.459984449535465
[2021-01-26 22:05:43,923] [INFO] [timer.py:166:stop] 0/380, SamplesPerSec=28.324652784159387
[2021-01-26 22:05:48,260] [INFO] [timer.py:166:stop] 0/390, SamplesPerSec=28.308406447212985
[2021-01-26 22:05:53,015] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 9
[2021-01-26 22:05:53,015] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 9
[2021-01-26 22:05:53,015] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:53,015] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:53,015] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 9
[2021-01-26 22:05:53,015] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:05:53,016] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:53,016] [INFO] [timer.py:166:stop] 0/400, SamplesPerSec=28.223310463627396
[400] Loss : 9.132528305053711
[400] Loss : 9.075061798095703
[400] Loss : 9.095661163330078
[2021-01-26 22:05:56,560] [INFO] [timer.py:166:stop] 0/410, SamplesPerSec=28.33996788409402
[2021-01-26 22:06:00,297] [INFO] [timer.py:166:stop] 0/420, SamplesPerSec=28.42059120597901
[2021-01-26 22:06:04,424] [INFO] [timer.py:166:stop] 0/430, SamplesPerSec=28.436581949396476
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 10
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 10
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 10
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:09,885] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:06:09,886] [INFO] [timer.py:166:stop] 0/440, SamplesPerSec=28.247754277670833
[2021-01-26 22:06:15,400] [INFO] [timer.py:166:stop] 0/450, SamplesPerSec=28.061949094040134
[2021-01-26 22:06:18,958] [INFO] [timer.py:166:stop] 0/460, SamplesPerSec=28.16612323346577
[2021-01-26 22:06:21,528] [INFO] [timer.py:166:stop] 0/470, SamplesPerSec=28.407862746855436
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 11
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 11
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 11
[2021-01-26 22:06:26,441] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:26,442] [INFO] [timer.py:166:stop] 0/480, SamplesPerSec=28.311940167509174
[2021-01-26 22:06:31,251] [INFO] [timer.py:166:stop] 0/490, SamplesPerSec=28.23473580319897
[2021-01-26 22:06:36,029] [INFO] [timer.py:166:stop] 0/500, SamplesPerSec=28.16535333902539
[2021-01-26 22:06:39,636] [INFO] [timer.py:166:stop] 0/510, SamplesPerSec=28.251514878827145
[2021-01-26 22:06:43,568] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 12
[2021-01-26 22:06:43,568] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:43,568] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 12
[2021-01-26 22:06:43,568] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:43,568] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:06:43,568] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 12
[2021-01-26 22:06:43,569] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:43,569] [INFO] [timer.py:166:stop] 0/520, SamplesPerSec=28.292804016474427
[2021-01-26 22:06:46,941] [INFO] [timer.py:166:stop] 0/530, SamplesPerSec=28.403807546738648
[2021-01-26 22:06:51,414] [INFO] [timer.py:166:stop] 0/540, SamplesPerSec=28.37377119408753
[2021-01-26 22:06:56,109] [INFO] [timer.py:166:stop] 0/550, SamplesPerSec=28.317642544218728
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 13
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 13
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 13
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:01,872] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:07:01,872] [INFO] [timer.py:166:stop] 0/560, SamplesPerSec=28.136855934257408
[2021-01-26 22:07:05,928] [INFO] [timer.py:166:stop] 0/570, SamplesPerSec=28.161827644606188
[2021-01-26 22:07:10,455] [INFO] [timer.py:166:stop] 0/580, SamplesPerSec=28.132263663068954
[2021-01-26 22:07:14,645] [INFO] [timer.py:166:stop] 0/590, SamplesPerSec=28.141332099515775
[2021-01-26 22:07:19,402] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 14
[2021-01-26 22:07:19,402] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:19,402] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 14
[2021-01-26 22:07:19,402] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:07:19,402] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:19,402] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 14
[2021-01-26 22:07:19,403] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:19,403] [INFO] [timer.py:166:stop] 0/600, SamplesPerSec=28.087681518154927
[600] Loss : 9.065657615661621
[600] Loss : 9.009527206420898
[600] Loss : 9.22071361541748

visionscaper avatar Jan 26 '21 22:01 visionscaper

Update

I have been able to confirm that training my seq2seq model, based on GRUs, using DeepSpeed with Nvidia APEX AMP does work. In this experiment I used settings:

[2021-01-26 23:27:46,088] [INFO] [config.py:709:print]   amp_enabled .................. True
[2021-01-26 23:27:46,088] [INFO] [config.py:709:print]   amp_params ................... {'opt_level': 'O1'}

Thus training in FP32 mode works, training with amp.enabled = true works, but training with fp16.enabled = true fails; the loss doesn't decrease and gradients overflow constantly.

Any help is appreciated to get this to work with fp16.enabled = true.

visionscaper avatar Jan 26 '21 22:01 visionscaper

I encountered the same situation in customized model, and it makes me feel stuck, because when you turn on apex amp as fp16 backend then you can't use zero.

Soonhwan-Kwon avatar Apr 05 '21 04:04 Soonhwan-Kwon

encountered roughly the same, but the problem was "nan gradients" caused by layer norm, or more specifically something similar to - https://github.com/pytorch/pytorch/issues/41527.

and with fairseq -
https://github.com/pytorch/fairseq/blob/fcca32258c8e8bcc9f9890bf4714fa2f96b6b3e1/fairseq/modules/layer_norm.py#L38

talmago avatar Nov 07 '21 19:11 talmago

same with longformer

djaym7 avatar Jun 16 '23 19:06 djaym7