Batch Size Choosing for single GPU Traing and Multiple GPU Train
Issue summary
I am using Caffe and ImageNet dataset training on GoogleNet(v1). When I do the single GPU (MI25) training, the training batch size I used is '128'. Then I change the training applied to Multiple MI25 training on hipCaffe, since the total GPU memory capacity has 4 times ( 16GB x4), the batch size should able to fit 512 image/batch(128 image/batch/card). From my testing result, the batch size cannot be changed, even just '192' (multiple of 64), it shows "error: 'hipErrorMemory Allocation'(1002)" .
Since the batch size only has '128', I just do a roughly math, the four cards training time will 3 ~ 3.5x longer as training time on 4xP100 system (batch_size=512).
May I ask is there some environment parameters should I set before the training which can help on enlarge the batch size on multiple GPU training?
I crossed check with one of my NVIDIA P100x4 Server, the batch size could be increased as long as I use more cards. The batch number mentioned above was based on my experience when I did on the same dataset, same network, with NVIDIA P100(16GB GDDR), and V100(16GB GDDR) Training job.
Steps to reproduce
Use the bvlc_googlenet training network under the hipCaffe installation path. The ImageNet dataset from ImageNet official website.
Your system configuration
Operating system: Ubuntu 16.04.3 Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5) CUDA version (if applicable): CUDNN version (if applicable): BLAS: USE_ROCBLAS := 1 Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7.12 Other: miopen-hip 1.1.4 miopengemm 1.1.5 rocm-libs 1.6.180 Server: Inventec P47 GPU: AMD MI25 x4 CPU: AMD EPYC 7601 x2 Memory: 512GB
Hi @dhzhd1,
Thanks for the feedback. If I'm understanding your comments correctly, I believe I just reproduced your setup, but I didn't hit OOM errors.
First, reboot and try re-running your workload.
If that doesn't work, can you please send the results of hipInfo? See this directory: /opt/rocm/hip/samples/1_Utils/hipInfo. Also, can you show how you are running this 4-GPU workload?
Thanks,
Jeff
PS - Here's an example of how you might accomplish a 4-GPU run. You'll have to point the prototxt files to wherever you have ImageNet data located.
Prepare GoogleNet
Params to be set by the user:
gpuids="0,1,2,3"
batchsize_per_gpu=128
iterations=500
model_path=models/bvlc_googlenet
Update the train_val prototxt's batch size:
train_val_prototxt=${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
cp ${model_path}/train_val.prototxt ${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
sed -i "s|batch_size: 32|batch_size: ${batchsize_per_gpu}|g" ./${train_val_prototxt}
Update the solver prototxt's max_iter, snapshot and train_val prototxt path:
solver_prototxt=${model_path}/solver_short.prototxt
cp ${model_path}/solver.prototxt ${solver_prototxt}
sed -i "s|max_iter: 10000000|max_iter: ${iterations}|g" ${solver_prototxt}
sed -i "s|snapshot: 40000|snapshot_after_train: 0|g" ${solver_prototxt}
sed -i "s|${model_path}/train_val.prototxt|${train_val_prototxt}|g" ${solver_prototxt}
Train with ImageNet data
Using the parameters set above, run it:
ngpus=$(( 1 + $(grep -o "," <<< "$gpuids" | wc -l) ))
train_log=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}.log
train_log_sec=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}_sec.log
./build/tools/caffe train --solver=${solver_prototxt} --gpu ${gpuids} 2>&1 | tee ${train_log}
Hi @parallelo, Thanks for your feedback. Currently, the system was shipped to SC17 show together with the MI25. I will provide the update when I get the system back.