Model doesn't learn with fp16.enabled = true: [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1

Open visionscaper opened this issue 5 years ago • 4 comments

Hello,

I'm trying to train a sequence to sequence model based on RNNs. When not using FP16 training (fp16.enabled = false) training works; my loss decreases. However, when training FP16 enabled (fp16.enabled = true), I continuously get gradient overflows. I have also trained the same model without Deepspeed, but with Pytorch native AMP enabled, this also works like a charm.

Note that without Deepspeed, when training in FP16 mode using Pytorch native AMP, I also didn't use any gradient clipping. So my expectation is that training with Deepspeed in FP16 mode would also not require gradient clipping.

As a reference, after making some small changes, I was also able to train the CIFAR-10 example with Deepspeed in FP16 mode; this works just fine, the loss decreases.

I basically would like to know what could be going wrong here? And if there is some specific setting I could use to resolve this issue?

Further I have tried :

multiple variations of settings with respect to the optimiser (Adam, original torch.Adam, eps=1e-16 etc.)
using different types of RNNs: torch.nn.GRU and Simple Recurrent Units SRUs
Turning off dropout

Unfortunately nothing seemed to help; I keep getting grad overflows with fp16.enabled = true in the Deepspeed config.

The one thing I didn't try yet is to use the Nvidia APEX AMP package option. This is because the SRUs I want to work with are not compatible with this library.

Below I have added two logs, showing the training results of the first 15 full-batch training steps (600 steps with train_micro_batch_size_per_gpu=4 batches, using 3x 1080Ti GPUs).

The first log is with fp16.enabled = false, here everything works fine. The second is with fp16.enabled = true, where you can see that training doesn't work. I print the loss every 200 iterations.

f16.enabled = false

2021-01-26 21:56:21,925] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
[2021-01-26 21:56:21,927] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
[2021-01-26 21:56:21,933] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/freddy/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.45142388343811035 seconds
[2021-01-26 21:56:22,527] [INFO] [engine.py:518:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2021-01-26 21:56:22,527] [INFO] [engine.py:521:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    weight_decay: 0.0
)
[2021-01-26 21:56:22,527] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    weight_decay: 0.0
)
[2021-01-26 21:56:22,527] [INFO] [engine.py:386:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2021-01-26 21:56:22,528] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2021-01-26 21:56:22,528] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0002], mom=[[0.9, 0.999]]
[2021-01-26 21:56:22,528] [INFO] [config.py:705:print] DeepSpeedEngine configuration:
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f310d079310>
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   allreduce_always_fp32 ........ False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   amp_enabled .................. False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   amp_params ................... False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   disable_allgather ............ False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   dump_state ................... False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   dynamic_loss_scale_args ...... None
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   elasticity_enabled ........... False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   fp16_enabled ................. False
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   global_rank .................. 0
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   gradient_accumulation_steps .. 40
[2021-01-26 21:56:22,528] [INFO] [config.py:709:print]   gradient_clipping ............ 0.0
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   gradient_predivide_factor .... 1.0
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   initial_dynamic_scale ........ 4294967296
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   loss_scale ................... 0
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   memory_breakdown ............. False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   optimizer_legacy_fusion ...... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   optimizer_name ............... adam
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   optimizer_params ............. {'lr': 0.0002, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0, 'adam_w_mode': True}
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   pld_enabled .................. False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   pld_params ................... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   prescale_gradients ........... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   scheduler_name ............... None
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   scheduler_params ............. None
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   sparse_attention ............. None
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   sparse_gradients_enabled ..... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   steps_per_print .............. 10
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   tensorboard_enabled .......... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   tensorboard_output_path ...... 
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   train_batch_size ............. 480
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   train_micro_batch_size_per_gpu  4
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   wall_clock_breakdown ......... False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   world_size ................... 3
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   zero_allow_untested_optimizer  False
[2021-01-26 21:56:22,529] [INFO] [config.py:709:print]   zero_config .................. {
    "allgather_bucket_size": 500000000,
    "allgather_partitions": true,
    "contiguous_gradients": false,
    "cpu_offload": false,
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": false,
    "reduce_bucket_size": 500000000,
    "reduce_scatter": true,
    "stage": 0
}
[2021-01-26 21:56:22,530] [INFO] [config.py:709:print]   zero_enabled ................. False
[2021-01-26 21:56:22,530] [INFO] [config.py:709:print]   zero_optimization_stage ...... 0
[2021-01-26 21:56:22,530] [INFO] [config.py:715:print]   json = {
    "fp16":{
        "enabled":false,
        "hysteresis":2,
        "initial_scale_power":32,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "optimizer":{
        "params":{
            "adam_w_mode":true,
            "betas":[
                0.9,
                0.999
            ],
            "eps":1e-08,
            "lr":0.0002,
            "weight_decay":0.0
        },
        "type":"Adam"
    },
    "steps_per_print":10,
    "train_batch_size":480,
    "train_micro_batch_size_per_gpu":4,
    "wall_clock_breakdown":false
}
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.502511739730835 seconds
[2021-01-26 21:56:22,579] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    weight_decay: 0.0
)
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.5021593570709229 seconds
[2021-01-26 21:56:22,582] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    weight_decay: 0.0
)
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/freddy/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.46016359329223633 seconds
Loading extension module utils...
Time to load utils op: 0.5020427703857422 seconds
Loading extension module utils...
Time to load utils op: 0.5018594264984131 seconds
[0] Loss : 8.999998092651367
[0] Loss : 9.15445327758789
[0] Loss : 9.038656234741211
[2021-01-26 21:56:27,111] [INFO] [timer.py:166:stop] 0/10, SamplesPerSec=41.825688446011114
[2021-01-26 21:56:29,632] [INFO] [timer.py:166:stop] 0/20, SamplesPerSec=44.92055533780422
[2021-01-26 21:56:32,695] [INFO] [timer.py:166:stop] 0/30, SamplesPerSec=42.71682476866671
[2021-01-26 21:56:36,806] [INFO] [timer.py:166:stop] 0/40, SamplesPerSec=38.09388969748721
[2021-01-26 21:56:39,871] [INFO] [timer.py:166:stop] 0/50, SamplesPerSec=38.32450114890898
[2021-01-26 21:56:43,123] [INFO] [timer.py:166:stop] 0/60, SamplesPerSec=38.08162107383462
[2021-01-26 21:56:46,116] [INFO] [timer.py:166:stop] 0/70, SamplesPerSec=38.37604235762581
[2021-01-26 21:56:50,138] [INFO] [timer.py:166:stop] 0/80, SamplesPerSec=37.02720955340785
[2021-01-26 21:56:53,651] [INFO] [timer.py:166:stop] 0/90, SamplesPerSec=36.68486886253886
[2021-01-26 21:56:56,666] [INFO] [timer.py:166:stop] 0/100, SamplesPerSec=36.98690788337102
[2021-01-26 21:56:59,704] [INFO] [timer.py:166:stop] 0/110, SamplesPerSec=37.2128951188848
[2021-01-26 21:57:03,404] [INFO] [timer.py:166:stop] 0/120, SamplesPerSec=36.75958540038136
[2021-01-26 21:57:06,613] [INFO] [timer.py:166:stop] 0/130, SamplesPerSec=36.813475069881996
[2021-01-26 21:57:09,089] [INFO] [timer.py:166:stop] 0/140, SamplesPerSec=37.47199109015046
[2021-01-26 21:57:12,203] [INFO] [timer.py:166:stop] 0/150, SamplesPerSec=37.547663000349296
[2021-01-26 21:57:15,901] [INFO] [timer.py:166:stop] 0/160, SamplesPerSec=37.18199497187463
[2021-01-26 21:57:19,326] [INFO] [timer.py:166:stop] 0/170, SamplesPerSec=37.050755020731636
[2021-01-26 21:57:22,658] [INFO] [timer.py:166:stop] 0/180, SamplesPerSec=36.99534049157816
[2021-01-26 21:57:26,399] [INFO] [timer.py:166:stop] 0/190, SamplesPerSec=36.699293495681665
[2021-01-26 21:57:30,106] [INFO] [timer.py:166:stop] 0/200, SamplesPerSec=36.45669125272261
[200] Loss : 6.800840377807617
[200] Loss : 6.5157470703125
[200] Loss : 6.383593559265137
[2021-01-26 21:57:33,361] [INFO] [timer.py:166:stop] 0/210, SamplesPerSec=36.47919520372467
[2021-01-26 21:57:36,581] [INFO] [timer.py:166:stop] 0/220, SamplesPerSec=36.51791265257378
[2021-01-26 21:57:39,454] [INFO] [timer.py:166:stop] 0/230, SamplesPerSec=36.72364568413719
[2021-01-26 21:57:43,891] [INFO] [timer.py:166:stop] 0/240, SamplesPerSec=36.18249218460241
[2021-01-26 21:57:47,133] [INFO] [timer.py:166:stop] 0/250, SamplesPerSec=36.217873588699675
[2021-01-26 21:57:49,918] [INFO] [timer.py:166:stop] 0/260, SamplesPerSec=36.44586303725491
[2021-01-26 21:57:53,301] [INFO] [timer.py:166:stop] 0/270, SamplesPerSec=36.41131573204769
[2021-01-26 21:57:56,770] [INFO] [timer.py:166:stop] 0/280, SamplesPerSec=36.34533813259211
[2021-01-26 21:57:59,588] [INFO] [timer.py:166:stop] 0/290, SamplesPerSec=36.53348492078434
[2021-01-26 21:58:02,475] [INFO] [timer.py:166:stop] 0/300, SamplesPerSec=36.68489320878963
[2021-01-26 21:58:05,754] [INFO] [timer.py:166:stop] 0/310, SamplesPerSec=36.68418674629942
[2021-01-26 21:58:11,375] [INFO] [timer.py:166:stop] 0/320, SamplesPerSec=35.87610928521094
[2021-01-26 21:58:14,369] [INFO] [timer.py:166:stop] 0/330, SamplesPerSec=35.99301471144916
[2021-01-26 21:58:17,638] [INFO] [timer.py:166:stop] 0/340, SamplesPerSec=36.015777056006975
[2021-01-26 21:58:19,991] [INFO] [timer.py:166:stop] 0/350, SamplesPerSec=36.32426130453177
[2021-01-26 21:58:24,502] [INFO] [timer.py:166:stop] 0/360, SamplesPerSec=35.95910636974815
[2021-01-26 21:58:27,445] [INFO] [timer.py:166:stop] 0/370, SamplesPerSec=36.07671240791601
[2021-01-26 21:58:30,814] [INFO] [timer.py:166:stop] 0/380, SamplesPerSec=36.06615135814506
[2021-01-26 21:58:33,900] [INFO] [timer.py:166:stop] 0/390, SamplesPerSec=36.13561707101366
[2021-01-26 21:58:37,243] [INFO] [logging.py:60:log_dist] [Rank 0] step=10, skipped=0, lr=[0.0002], mom=[[0.9, 0.999]]
[2021-01-26 21:58:37,745] [INFO] [timer.py:166:stop] 0/400, SamplesPerSec=35.994666477704214
[400] Loss : 5.665283203125
[400] Loss : 5.826916694641113
[400] Loss : 6.230928897857666
[2021-01-26 21:58:41,046] [INFO] [timer.py:166:stop] 0/410, SamplesPerSec=36.0048993901217
[2021-01-26 21:58:44,993] [INFO] [timer.py:166:stop] 0/420, SamplesPerSec=35.848783717439545
[2021-01-26 21:58:47,882] [INFO] [timer.py:166:stop] 0/430, SamplesPerSec=35.96537794509143
[2021-01-26 21:58:50,543] [INFO] [timer.py:166:stop] 0/440, SamplesPerSec=36.13387347202367
[2021-01-26 21:58:54,031] [INFO] [timer.py:166:stop] 0/450, SamplesPerSec=36.094787379021554
[2021-01-26 21:58:57,270] [INFO] [timer.py:166:stop] 0/460, SamplesPerSec=36.11669077796843
[2021-01-26 21:59:01,135] [INFO] [timer.py:166:stop] 0/470, SamplesPerSec=35.99268381574188
[2021-01-26 21:59:04,756] [INFO] [timer.py:166:stop] 0/480, SamplesPerSec=35.929526864717516
[2021-01-26 21:59:08,056] [INFO] [timer.py:166:stop] 0/490, SamplesPerSec=35.93959961536687
[2021-01-26 21:59:11,585] [INFO] [timer.py:166:stop] 0/500, SamplesPerSec=35.90001015391583
[2021-01-26 21:59:14,563] [INFO] [timer.py:166:stop] 0/510, SamplesPerSec=35.97882584862603
[2021-01-26 21:59:18,774] [INFO] [timer.py:166:stop] 0/520, SamplesPerSec=35.798695039910946
[2021-01-26 21:59:21,891] [INFO] [timer.py:166:stop] 0/530, SamplesPerSec=35.847454337742384
[2021-01-26 21:59:25,483] [INFO] [timer.py:166:stop] 0/540, SamplesPerSec=35.80024361048666
[2021-01-26 21:59:28,410] [INFO] [timer.py:166:stop] 0/550, SamplesPerSec=35.88450805809668
[2021-01-26 21:59:31,547] [INFO] [timer.py:166:stop] 0/560, SamplesPerSec=35.92561504682371
[2021-01-26 21:59:34,450] [INFO] [timer.py:166:stop] 0/570, SamplesPerSec=36.00956924962096
[2021-01-26 21:59:37,708] [INFO] [timer.py:166:stop] 0/580, SamplesPerSec=36.02444682541163
[2021-01-26 21:59:40,886] [INFO] [timer.py:166:stop] 0/590, SamplesPerSec=36.0539303626204
[2021-01-26 21:59:45,092] [INFO] [timer.py:166:stop] 0/600, SamplesPerSec=35.89674910436057
[600] Loss : 4.872910976409912
[600] Loss : 5.67504358291626
[600] Loss : 4.949432373046875

f16.enabled = true

[2021-01-26 22:03:00,961] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
[2021-01-26 22:03:00,974] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
[2021-01-26 22:03:00,983] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 3, parameter_parallel_size: 3
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/freddy/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.4628634452819824 seconds
[2021-01-26 22:03:01,585] [INFO] [engine.py:603:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
[2021-01-26 22:03:01,595] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    step: 1
    weight_decay: 0.0
)
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.5023303031921387 seconds
[2021-01-26 22:03:01,624] [INFO] [engine.py:518:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2021-01-26 22:03:01,625] [INFO] [engine.py:521:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    weight_decay: 0.0
)
[2021-01-26 22:03:01,625] [INFO] [engine.py:603:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
Loading extension module fused_adam...
Time to load fused_adam op: 0.5022878646850586 seconds
[2021-01-26 22:03:01,630] [INFO] [engine.py:603:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
[2021-01-26 22:03:01,630] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    step: 1
    weight_decay: 0.0
)
[2021-01-26 22:03:01,631] [INFO] [engine.py:386:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2021-01-26 22:03:01,631] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2021-01-26 22:03:01,631] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0002], mom=[[0.9, 0.999]]
[2021-01-26 22:03:01,631] [INFO] [config.py:705:print] DeepSpeedEngine configuration:
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f5d487266d0>
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   allreduce_always_fp32 ........ False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   amp_enabled .................. False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   amp_params ................... False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   disable_allgather ............ False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   dump_state ................... False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   dynamic_loss_scale_args ...... {'init_scale': 64, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   elasticity_enabled ........... False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   fp16_enabled ................. True
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   global_rank .................. 0
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   gradient_accumulation_steps .. 40
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   gradient_clipping ............ 0.0
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   gradient_predivide_factor .... 1.0
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   initial_dynamic_scale ........ 64
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   loss_scale ................... 0
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   memory_breakdown ............. False
[2021-01-26 22:03:01,631] [INFO] [config.py:709:print]   optimizer_legacy_fusion ...... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   optimizer_name ............... adam
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   optimizer_params ............. {'lr': 0.0002, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0, 'adam_w_mode': True}
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   pld_enabled .................. False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   pld_params ................... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   prescale_gradients ........... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   scheduler_name ............... None
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   scheduler_params ............. None
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   sparse_attention ............. None
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   sparse_gradients_enabled ..... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   steps_per_print .............. 10
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   tensorboard_enabled .......... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   tensorboard_output_path ...... 
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   train_batch_size ............. 480
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   train_micro_batch_size_per_gpu  4
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   wall_clock_breakdown ......... False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   world_size ................... 3
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   zero_allow_untested_optimizer  False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   zero_config .................. {
    "allgather_bucket_size": 500000000,
    "allgather_partitions": true,
    "contiguous_gradients": false,
    "cpu_offload": false,
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": false,
    "reduce_bucket_size": 500000000,
    "reduce_scatter": true,
    "stage": 0
}
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   zero_enabled ................. False
[2021-01-26 22:03:01,632] [INFO] [config.py:709:print]   zero_optimization_stage ...... 0
[2021-01-26 22:03:01,633] [INFO] [config.py:715:print]   json = {
    "fp16":{
        "enabled":true,
        "hysteresis":2,
        "initial_scale_power":6,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "optimizer":{
        "params":{
            "adam_w_mode":true,
            "betas":[
                0.9,
                0.999
            ],
            "eps":1e-08,
            "lr":0.0002,
            "weight_decay":0.0
        },
        "type":"Adam"
    },
    "steps_per_print":10,
    "train_batch_size":480,
    "train_micro_batch_size_per_gpu":4,
    "wall_clock_breakdown":false
}
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
[2021-01-26 22:03:01,636] [INFO] [engine.py:551:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.0002
    step: 1
    weight_decay: 0.0
)
Using /home/freddy/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/freddy/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.46137571334838867 seconds
Loading extension module utils...
Time to load utils op: 0.5019605159759521 seconds
Loading extension module utils...
Time to load utils op: 0.5017917156219482 seconds
[0] Loss : 9.222114562988281
[0] Loss : 9.164878845214844
[0] Loss : 9.16104507446289
[2021-01-26 22:03:06,380] [INFO] [timer.py:166:stop] 0/10, SamplesPerSec=33.97458149325142
[2021-01-26 22:03:10,020] [INFO] [timer.py:166:stop] 0/20, SamplesPerSec=33.436643374152254
[2021-01-26 22:03:14,097] [INFO] [timer.py:166:stop] 0/30, SamplesPerSec=31.906658939683734
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 0
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 0
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 64 to 32.0
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 64 to 32.0
[2021-01-26 22:03:18,711] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 64, reducing to 32.0
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 0
[2021-01-26 22:03:18,711] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 64 to 32.0
[2021-01-26 22:03:18,712] [INFO] [timer.py:166:stop] 0/40, SamplesPerSec=30.12004606159127
[2021-01-26 22:03:21,767] [INFO] [timer.py:166:stop] 0/50, SamplesPerSec=31.665849924273356
[2021-01-26 22:03:25,613] [INFO] [timer.py:166:stop] 0/60, SamplesPerSec=31.59344673408403
[2021-01-26 22:03:29,348] [INFO] [timer.py:166:stop] 0/70, SamplesPerSec=31.677591534884105
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 1
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 1
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 1
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0
[2021-01-26 22:03:35,362] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0
[2021-01-26 22:03:35,362] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32.0, reducing to 16.0
[2021-01-26 22:03:35,363] [INFO] [timer.py:166:stop] 0/80, SamplesPerSec=29.463939879601927
[2021-01-26 22:03:39,386] [INFO] [timer.py:166:stop] 0/90, SamplesPerSec=29.50914591740365
[2021-01-26 22:03:42,581] [INFO] [timer.py:166:stop] 0/100, SamplesPerSec=30.174574428668237
[2021-01-26 22:03:46,681] [INFO] [timer.py:166:stop] 0/110, SamplesPerSec=30.092707542682465
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 2
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 2
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0
[2021-01-26 22:03:52,492] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16.0, reducing to 8.0
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 2
[2021-01-26 22:03:52,492] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0
[2021-01-26 22:03:52,493] [INFO] [timer.py:166:stop] 0/120, SamplesPerSec=28.97346976173917
[2021-01-26 22:03:55,650] [INFO] [timer.py:166:stop] 0/130, SamplesPerSec=29.524899679768
[2021-01-26 22:03:59,609] [INFO] [timer.py:166:stop] 0/140, SamplesPerSec=29.58340685172515
[2021-01-26 22:04:03,811] [INFO] [timer.py:166:stop] 0/150, SamplesPerSec=29.515393227502443
[2021-01-26 22:04:09,190] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 3
[2021-01-26 22:04:09,190] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0
[2021-01-26 22:04:09,190] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 3
[2021-01-26 22:04:09,190] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 3
[2021-01-26 22:04:09,191] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0
[2021-01-26 22:04:09,191] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0
[2021-01-26 22:04:09,191] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8.0, reducing to 4.0
[2021-01-26 22:04:09,191] [INFO] [timer.py:166:stop] 0/160, SamplesPerSec=28.925970601141177
[2021-01-26 22:04:13,267] [INFO] [timer.py:166:stop] 0/170, SamplesPerSec=28.958373700812057
[2021-01-26 22:04:16,777] [INFO] [timer.py:166:stop] 0/180, SamplesPerSec=29.21181410836682
[2021-01-26 22:04:20,485] [INFO] [timer.py:166:stop] 0/190, SamplesPerSec=29.366318279309837
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 4
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 4.0 to 2.0
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 4
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 4
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 4.0 to 2.0
[2021-01-26 22:04:26,297] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 4.0 to 2.0
[2021-01-26 22:04:26,297] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4.0, reducing to 2.0
[2021-01-26 22:04:26,297] [INFO] [timer.py:166:stop] 0/200, SamplesPerSec=28.755044316105973
[200] Loss : 9.228988647460938
[200] Loss : 9.030767440795898
[200] Loss : 9.148602485656738
[2021-01-26 22:04:30,651] [INFO] [timer.py:166:stop] 0/210, SamplesPerSec=28.697343434387207
[2021-01-26 22:04:34,646] [INFO] [timer.py:166:stop] 0/220, SamplesPerSec=28.758270700378173
[2021-01-26 22:04:39,122] [INFO] [timer.py:166:stop] 0/230, SamplesPerSec=28.668776412175756
[2021-01-26 22:04:42,789] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 5
[2021-01-26 22:04:42,789] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 5
[2021-01-26 22:04:42,789] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 2.0 to 1.0
[2021-01-26 22:04:42,790] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 2.0 to 1.0
[2021-01-26 22:04:42,789] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 5
[2021-01-26 22:04:42,790] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 2.0, reducing to 1.0
[2021-01-26 22:04:42,790] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 2.0 to 1.0
[2021-01-26 22:04:42,790] [INFO] [timer.py:166:stop] 0/240, SamplesPerSec=28.820362728559836
[2021-01-26 22:04:47,568] [INFO] [timer.py:166:stop] 0/250, SamplesPerSec=28.65161559940446
[2021-01-26 22:04:50,860] [INFO] [timer.py:166:stop] 0/260, SamplesPerSec=28.892728584551964
[2021-01-26 22:04:55,465] [INFO] [timer.py:166:stop] 0/270, SamplesPerSec=28.77793751011365
[2021-01-26 22:04:59,154] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 6
[2021-01-26 22:04:59,154] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 6
[2021-01-26 22:04:59,154] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1.0 to 1
[2021-01-26 22:04:59,154] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1.0 to 1
[2021-01-26 22:04:59,154] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1.0, reducing to 1
[2021-01-26 22:04:59,154] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 6
[2021-01-26 22:04:59,155] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1.0 to 1
[2021-01-26 22:04:59,155] [INFO] [timer.py:166:stop] 0/280, SamplesPerSec=28.8990514991402
[2021-01-26 22:05:02,569] [INFO] [timer.py:166:stop] 0/290, SamplesPerSec=29.07988592190228
[2021-01-26 22:05:06,048] [INFO] [timer.py:166:stop] 0/300, SamplesPerSec=29.235383667331682
[2021-01-26 22:05:10,700] [INFO] [timer.py:166:stop] 0/310, SamplesPerSec=29.110768684030617
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 7
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 7
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 7
[2021-01-26 22:05:16,472] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:05:16,472] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:16,473] [INFO] [timer.py:166:stop] 0/320, SamplesPerSec=28.750032872073373
[2021-01-26 22:05:20,190] [INFO] [timer.py:166:stop] 0/330, SamplesPerSec=28.84761905966786
[2021-01-26 22:05:23,689] [INFO] [timer.py:166:stop] 0/340, SamplesPerSec=28.985128476811887
[2021-01-26 22:05:27,258] [INFO] [timer.py:166:stop] 0/350, SamplesPerSec=29.10193786516919
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 8
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 8
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 8
[2021-01-26 22:05:35,108] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:05:35,108] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:35,108] [INFO] [timer.py:166:stop] 0/360, SamplesPerSec=28.38650866746092
[2021-01-26 22:05:38,939] [INFO] [timer.py:166:stop] 0/370, SamplesPerSec=28.459984449535465
[2021-01-26 22:05:43,923] [INFO] [timer.py:166:stop] 0/380, SamplesPerSec=28.324652784159387
[2021-01-26 22:05:48,260] [INFO] [timer.py:166:stop] 0/390, SamplesPerSec=28.308406447212985
[2021-01-26 22:05:53,015] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 9
[2021-01-26 22:05:53,015] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 9
[2021-01-26 22:05:53,015] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:53,015] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:53,015] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 9
[2021-01-26 22:05:53,015] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:05:53,016] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:05:53,016] [INFO] [timer.py:166:stop] 0/400, SamplesPerSec=28.223310463627396
[400] Loss : 9.132528305053711
[400] Loss : 9.075061798095703
[400] Loss : 9.095661163330078
[2021-01-26 22:05:56,560] [INFO] [timer.py:166:stop] 0/410, SamplesPerSec=28.33996788409402
[2021-01-26 22:06:00,297] [INFO] [timer.py:166:stop] 0/420, SamplesPerSec=28.42059120597901
[2021-01-26 22:06:04,424] [INFO] [timer.py:166:stop] 0/430, SamplesPerSec=28.436581949396476
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 10
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 10
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 10
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:09,885] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:09,885] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:06:09,886] [INFO] [timer.py:166:stop] 0/440, SamplesPerSec=28.247754277670833
[2021-01-26 22:06:15,400] [INFO] [timer.py:166:stop] 0/450, SamplesPerSec=28.061949094040134
[2021-01-26 22:06:18,958] [INFO] [timer.py:166:stop] 0/460, SamplesPerSec=28.16612323346577
[2021-01-26 22:06:21,528] [INFO] [timer.py:166:stop] 0/470, SamplesPerSec=28.407862746855436
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 11
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 11
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 11
[2021-01-26 22:06:26,441] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:06:26,441] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:26,442] [INFO] [timer.py:166:stop] 0/480, SamplesPerSec=28.311940167509174
[2021-01-26 22:06:31,251] [INFO] [timer.py:166:stop] 0/490, SamplesPerSec=28.23473580319897
[2021-01-26 22:06:36,029] [INFO] [timer.py:166:stop] 0/500, SamplesPerSec=28.16535333902539
[2021-01-26 22:06:39,636] [INFO] [timer.py:166:stop] 0/510, SamplesPerSec=28.251514878827145
[2021-01-26 22:06:43,568] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 12
[2021-01-26 22:06:43,568] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:43,568] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 12
[2021-01-26 22:06:43,568] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:43,568] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:06:43,568] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 12
[2021-01-26 22:06:43,569] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:06:43,569] [INFO] [timer.py:166:stop] 0/520, SamplesPerSec=28.292804016474427
[2021-01-26 22:06:46,941] [INFO] [timer.py:166:stop] 0/530, SamplesPerSec=28.403807546738648
[2021-01-26 22:06:51,414] [INFO] [timer.py:166:stop] 0/540, SamplesPerSec=28.37377119408753
[2021-01-26 22:06:56,109] [INFO] [timer.py:166:stop] 0/550, SamplesPerSec=28.317642544218728
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 13
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 13
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 13
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:01,872] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:01,872] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:07:01,872] [INFO] [timer.py:166:stop] 0/560, SamplesPerSec=28.136855934257408
[2021-01-26 22:07:05,928] [INFO] [timer.py:166:stop] 0/570, SamplesPerSec=28.161827644606188
[2021-01-26 22:07:10,455] [INFO] [timer.py:166:stop] 0/580, SamplesPerSec=28.132263663068954
[2021-01-26 22:07:14,645] [INFO] [timer.py:166:stop] 0/590, SamplesPerSec=28.141332099515775
[2021-01-26 22:07:19,402] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 14
[2021-01-26 22:07:19,402] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:19,402] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 14
[2021-01-26 22:07:19,402] [INFO] [logging.py:60:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2021-01-26 22:07:19,402] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:19,402] [INFO] [fused_optimizer.py:314:_update_scale] 
Grad overflow on iteration 14
[2021-01-26 22:07:19,403] [INFO] [fused_optimizer.py:316:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-01-26 22:07:19,403] [INFO] [timer.py:166:stop] 0/600, SamplesPerSec=28.087681518154927
[600] Loss : 9.065657615661621
[600] Loss : 9.009527206420898
[600] Loss : 9.22071361541748

Jan 26 '21 22:01 visionscaper

Update

I have been able to confirm that training my seq2seq model, based on GRUs, using DeepSpeed with Nvidia APEX AMP does work. In this experiment I used settings:

[2021-01-26 23:27:46,088] [INFO] [config.py:709:print]   amp_enabled .................. True
[2021-01-26 23:27:46,088] [INFO] [config.py:709:print]   amp_params ................... {'opt_level': 'O1'}

Thus training in FP32 mode works, training with amp.enabled = true works, but training with fp16.enabled = true fails; the loss doesn't decrease and gradients overflow constantly.

Any help is appreciated to get this to work with fp16.enabled = true.

Jan 26 '21 22:01 visionscaper

I encountered the same situation in customized model, and it makes me feel stuck, because when you turn on apex amp as fp16 backend then you can't use zero.

Apr 05 '21 04:04 Soonhwan-Kwon

encountered roughly the same, but the problem was "nan gradients" caused by layer norm, or more specifically something similar to - https://github.com/pytorch/pytorch/issues/41527.

and with fairseq -
https://github.com/pytorch/fairseq/blob/fcca32258c8e8bcc9f9890bf4714fa2f96b6b3e1/fairseq/modules/layer_norm.py#L38

Nov 07 '21 19:11 talmago

same with longformer

Jun 16 '23 19:06 djaym7