hipCaffe Batch Size Choosing for single GPU Traing and Multiple GPU Train

Issue summary

I am using Caffe and ImageNet dataset training on GoogleNet(v1). When I do the single GPU (MI25) training, the training batch size I used is '128'. Then I change the training applied to Multiple MI25 training on hipCaffe, since the total GPU memory capacity has 4 times ( 16GB x4), the batch size should able to fit 512 image/batch(128 image/batch/card). From my testing result, the batch size cannot be changed, even just '192' (multiple of 64), it shows "error: 'hipErrorMemory Allocation'(1002)" .

Since the batch size only has '128', I just do a roughly math, the four cards training time will 3 ~ 3.5x longer as training time on 4xP100 system (batch_size=512).

May I ask is there some environment parameters should I set before the training which can help on enlarge the batch size on multiple GPU training?

I crossed check with one of my NVIDIA P100x4 Server, the batch size could be increased as long as I use more cards. The batch number mentioned above was based on my experience when I did on the same dataset, same network, with NVIDIA P100(16GB GDDR), and V100(16GB GDDR) Training job.

Steps to reproduce

Use the bvlc_googlenet training network under the hipCaffe installation path. The ImageNet dataset from ImageNet official website.

Your system configuration

Operating system: Ubuntu 16.04.3 Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5) CUDA version (if applicable): CUDNN version (if applicable): BLAS: USE_ROCBLAS := 1 Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7.12 Other: miopen-hip 1.1.4 miopengemm 1.1.5 rocm-libs 1.6.180 Server: Inventec P47 GPU: AMD MI25 x4 CPU: AMD EPYC 7601 x2 Memory: 512GB

Nov 02 '17 21:11 dhzhd1

Hi @dhzhd1,

Thanks for the feedback. If I'm understanding your comments correctly, I believe I just reproduced your setup, but I didn't hit OOM errors.

First, reboot and try re-running your workload.

If that doesn't work, can you please send the results of hipInfo? See this directory: /opt/rocm/hip/samples/1_Utils/hipInfo. Also, can you show how you are running this 4-GPU workload?

Thanks,

Jeff

PS - Here's an example of how you might accomplish a 4-GPU run. You'll have to point the prototxt files to wherever you have ImageNet data located.

Prepare GoogleNet

Params to be set by the user:

gpuids="0,1,2,3"
batchsize_per_gpu=128
iterations=500
model_path=models/bvlc_googlenet

Update the train_val prototxt's batch size:

train_val_prototxt=${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
cp ${model_path}/train_val.prototxt ${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
sed -i "s|batch_size: 32|batch_size: ${batchsize_per_gpu}|g" ./${train_val_prototxt}

Update the solver prototxt's max_iter, snapshot and train_val prototxt path:

solver_prototxt=${model_path}/solver_short.prototxt
cp ${model_path}/solver.prototxt ${solver_prototxt}
sed -i "s|max_iter: 10000000|max_iter: ${iterations}|g" ${solver_prototxt}
sed -i "s|snapshot: 40000|snapshot_after_train: 0|g" ${solver_prototxt}
sed -i "s|${model_path}/train_val.prototxt|${train_val_prototxt}|g" ${solver_prototxt}

Train with ImageNet data

Using the parameters set above, run it:

ngpus=$(( 1 + $(grep -o "," <<< "$gpuids" | wc -l) ))
train_log=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}.log
train_log_sec=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}_sec.log
./build/tools/caffe train --solver=${solver_prototxt} --gpu ${gpuids} 2>&1 | tee ${train_log}

Nov 06 '17 18:11 parallelo

Hi @parallelo, Thanks for your feedback. Currently, the system was shipped to SC17 show together with the MI25. I will provide the update when I get the system back.

Nov 06 '17 18:11 dhzhd1