Alex Sergeev

Results 21 comments of Alex Sergeev

@khanx169, this may be MPI specific. Can you run `horovodrun --gloo ...` to see if it helps?

@jiajinyu, can you share the exact commands you're using for NCCL and Horovod, version of TensorFlow, Horovod, and git sha of benchmarks you're using?

Thanks! I notice you use resnet50_v2 in one case, and resnet50 in another - is that intentional? To get the git sha, please use `git rev-parse HEAD` in your benchmarks...

I am running a repro now. The first issue I encountered is that with the exact flags that you specified, benchmark fails with: `ValueError: Could not identify name of dataset....

@jiajinyu, I'm looking more into it. It's possible that https://github.com/tensorflow/benchmarks/commit/d7b68b146c82ee9b936bd196c9f1ed6d54f4a1c7#diff-eae1728a56f07ec0458d8cc14f288807R2370 should be reverted and, instead, learning rate should be increased. I did a quick test and got much better convergence,...

Submitted PR #200 which fixes the learning rate adjustment for models that have custom learning rate schedule, e.g. ResNet-50. ![image](https://user-images.githubusercontent.com/16640218/40872516-47fbdf88-6604-11e8-8ad7-98f5132c9456.png) cc @reedwm

I think the right way is to compute learning rate based on `# examples / total batch size` and #200 fixes that. That said, having to sum gradients instead of...

Indeed. @reedwm, what do you think?

@lcytzk, I think it's because they all define a different learning rate schedule that may depend on the batch size. ResNet defines [learning rate warmup](https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/models/resnet_model.py#L227) which is affected by batch...

@reedwm, sorry for the delay. Makes sense, I will update the PR.