plasma-python icon indicating copy to clipboard operation
plasma-python copied to clipboard

ETA calculation is inaccurate

Open felker opened this issue 6 years ago • 0 comments

Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):

[0] step: 0 [ETA: 468568011.02s] [0.00/1789], loss: 1.05701 [1.05701] | walltime: 5.7374 | 8.47E+02 Examples/sec | 6.04E-01 sec/batch [92.3% calc., 7.7% sync.][batch = 512 = 128*4] [lr = 7.30E-05 = 1.83E-05*4]

The ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:

  1. The ETA computed in the first step of any epoch is always inaccurate.
  2. For later epochs within a session, the ETA increases nearly monotonically for many steps before starting to decrease nearly monotonically.

First step

For the first epoch in a given session, it gives a huge ETA since MPI_Model.num_so_far is zero, resulting in work_so_far of 0 being passed to: https://github.com/PPPLDeepLearning/plasma-python/blob/c82ba61e339882a5af10b1052edc0348e16119f4/plasma/models/mpi_runner.py#L613-L616 causing total_time to explode.

  • [ ] Probably should just refuse to give an ETA for the first step (or steps) of the first epoch

For later epochs within a session, it gives a minuscule ETA:

step: 0 [ETA: 0.55s] [1819.00/1789], loss: 0.98688 [0.98688] | walltime: 174.4240 | 8.93E+02 Examples/sec | 5.73E-01 sec/batch [96.1% calc., 3.9% sync.][batch = 512 = 128*4] [lr = 7.08E-05 = 1.77E-05*4]
  • [ ] I think an error was introduced when I changed the 0-based indexing of the epochs 1-2 months ago.

Later steps in later epochs

E.g. here are the ETAs for some later epoch:


ETA: 0.55s
ETA: 22.14
ETA: 27.98
ETA: 31.63
ETA: 35.88
ETA: 38.45
ETA: 34.89
ETA: 36.21
ETA: 35.35
ETA: 35.56
ETA: 36.04
ETA: 35.88
ETA: 35.33
ETA: 34.49
ETA: 34.73
ETA: 34.29
ETA: 34.13
ETA: 33.51
ETA: 33.16
…
ETA: 1.35s
ETA: 1.06s
ETA: 0.67s
ETA: 0.11s
ETA: -0.45
  • [ ] Consider using the measured runtimes of the previous epochs within this session to inform the ETA in later epochs.

felker avatar Jan 07 '20 02:01 felker