Merlin Establish a consistent memory allocation strategy for Tensorflow Memory

Currently memory allocation for TF in our examples in inconsistent and causes issues.

[ ] Update NVTabular TF examples
[ ] Update Merlin models config_tensorflow() function

Let's figure out best practices and make it consistent.

Apr 21 '22 16:04 EvenOldridge

I think we discussed that TensorFlow 2.8 will set cuda_malloc_async by default.

@rnyak shared, if we remove os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async", that TensorFlow2.8 will consume the full GPU memory.

I reproduced the behavior in our merlin-tensorflow-training:22.04 container.

Set nothing: 30/40GB (I think it should be 38 GB out of 40 GB, need to double check)
Set cuda_malloc_async: 0.5/40GB
Set TF_MEMORY_ALLOCATION=0.5: 38/40GB
Use configure_tensorflow: 21/40GB (default behavior is 50%)

@jperez999 I think you mentioned that TF2.8 will set it by default. Do you have any reference? Do you observe the same behavior?

What should we use for memory allocation in our examples? I think the best user experience is cuda_malloc_async. Should we add it explicit to all examples with the note, that it is only available for TF2.8 and add an reference to Troubleshooting?

In the Troubleshooting, we can add a section for older TF versions?

Nothing:

import tensorflow as tf
print(tf.__version__)
tf.constant([0,1,2])

cuda_malloc_async:

import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
import tensorflow as tf
print(tf.__version__)
tf.constant([0,1,2])

TF_MEMORY_ALLOCATION:

import os
os.environ["TF_MEMORY_ALLOCATION"] = "0.5"
import tensorflow as tf
print(tf.__version__)
tf.constant([0,1,2])

configure_tensorflow:

from merlin.models.loader.tf_utils import configure_tensorflow
configure_tensorflow()
import tensorflow as tf
print(tf.__version__)
tf.constant([0,1,2])

Apr 27 '22 09:04 bschifferer

Let's test 2.7 if cuda_malloc_async works

Apr 27 '22 16:04 bschifferer

rename to allocate_tensorflow_memory add kw type=dynamic | fixed | None
if default None it will use best based on tf version if fixed force use of tf_memory_allocation if dynamic try to use cuda-malloc-async if tf version => 2.8.0

Apr 27 '22 16:04 jperez999

Tensorflow 2.6 behavior (21.12 container)

Set nothing: 31/32GB Set cuda_malloc_async: 0.5/32GB Set TF_MEMORY_ALLOCATION=0.5: 31/32GB Use configure_tensorflow: is not availble in the docker container

Apr 28 '22 14:04 bschifferer

Tensorflow 2.7 behavior (22.02 container)

Set nothing: 31/32GB Set cuda_malloc_async: 31/32GB Set TF_MEMORY_ALLOCATION=0.5: 31/32GB Use configure_tensorflow: is not availble in the docker container

Apr 28 '22 14:04 bschifferer

@rnyak @jperez999 I am not sure, how we should continue with the TensorFlow allocation logic :) . I do not understand, why it works for 2.6 and it does not work for 2.7?

Apr 28 '22 14:04 bschifferer

@bschifferer can we add the details in a README about TF memory allocation behaviors wrt different TF versions and in the example nbs we just say something like this nb was developed with TF 2.8 ... and for TF 2.6 and 2.7 versions please visit README and learn tips about tf memory allocation what do you think?

Apr 28 '22 14:04 rnyak

Yes, we can do it. But I wonder, why cuda_malloc_async works for TF2.6 but it does not work for TF2.7?

Apr 28 '22 14:04 bschifferer

I rerun the test with native installed TensorFlow:

TF2.6 (pip)

Nothing: 31/32 GB
Set cuda_malloc_async: kernel dies
Set TF_MEMORY_ALLOCATION=0.5: 31/32GB

TF2.7 (pip)

Nothing: 31/32 GB
Set cuda_malloc_async: 0.5/32GB
Set TF_MEMORY_ALLOCATION=0.5: 31/32GB

TF2.8 (pip)

Nothing: 31/32 GB
Set cuda_malloc_async: 0.5/32GB
Set TF_MEMORY_ALLOCATION=0.5: 31/32GB

May 03 '22 08:05 bschifferer

rename to allocate_tensorflow_memory add kw type=dynamic | fixed | None if default None it will use best based on tf version if fixed force use of tf_memory_allocation if dynamic try to use cuda-malloc-async if tf version => 2.8.0

@jperez999 I think that behavior is correct. In theory, cuda-malloc-async can work with TF2.7.0 but it depends on the environment. It did not work with our own container, but it worked, when installing it from pip.

May 03 '22 08:05 bschifferer

@bschifferer one note to this thread:

run_ensemble_on_tritonserver is giving error if we set cuda_malloc_async.

May 04 '22 13:05 rnyak

@jperez999 have you had a change to update configure_tensorflow to allocate_tensorflow_memory ?

May 12 '22 09:05 bschifferer

@EvenOldridge , Should this be added to 22.08 scope ? Its not clear how this maps to the roadmap

Jul 18 '22 21:07 viswa-nvidia

@viswa-nvidia I closed the ticket as there was no progress for a long time. Please reopen, if we should work on it.

Apr 03 '23 11:04 bschifferer