BERT4doc-Classification icon indicating copy to clipboard operation
BERT4doc-Classification copied to clipboard

OOM when batchSize=1

Open chen3082 opened this issue 5 years ago • 3 comments

Hi, thanks for your great work. While running run_pretraining.py, I kept getting OOM for any size of the matrix. I already reduce the batch size to 1 but didn't help. I'm using 960M, TensorFlow-gpu1.10, Cuda toolkit 9.0 I'm wondering what version of TensorFlow are you using? Any thoughts on this issue? Thanks in advance.

chen3082 avatar Jan 30 '21 11:01 chen3082

Hi, I tried run_pretraining.py recently, works fine to me. I'm using tensorflow-gpu=1.15.0, cudatoolkit=10.0. First, I think that 960M has very limited VRAM, that could cause your issue. Second, make sure that you use the same setting when running create_pretraining_data.py and run_pretraining.py. I had set once max_seq_length=512 in create_pretraining_data.py, but max_seq_length=128 in run_pretraining.py. That will also break the code, but not because of the OOM, I think.

addiu avatar Feb 03 '21 10:02 addiu

sorry for the late answer

as shown above, 960M may have very limited memory. a GPU with 12G memory can only contain batch size=6 if max_seq_len=512. so please reduce your max sequence length or improve your GPU, thank you!

xuyige avatar Feb 19 '21 20:02 xuyige

Hi, I tried run_pretraining.py recently, works fine to me. I'm using tensorflow-gpu=1.15.0, cudatoolkit=10.0. First, I think that 960M has very limited VRAM, that could cause your issue. Second, make sure that you use the same setting when running create_pretraining_data.py and run_pretraining.py. I had set once max_seq_length=512 in create_pretraining_data.py, but max_seq_length=128 in run_pretraining.py. That will also break the code, but not because of the OOM, I think.

thank you for your issue

could you please show more detail about your error? otherwise, I forgot which version of tenserflow we used, but following the official bert repo, I suggest you trying to downgrade your tensorflow version (the official repo shows tensorflow-gpu >= 1.11.0, so maybe 1.11 or 1.12 can solve your problem)

xuyige avatar Feb 19 '21 20:02 xuyige