Failure in convert Gemma 2B models to TfLite
I tried converting Google Gemma 2B models to TfLite. Found it ending in failure
1. System information
- Ubuntu 22.04
- TensorFlow installation (installed with keras-nlp) :
- TensorFlow library (installed with keras-nlp):
2. Code
import os
import keras
import os
import numpy as np
import keras_nlp
import tensorflow as tf
import tensorflow_text as tf_text
from tensorflow import keras
from tensorflow.lite.python import interpreter
import time
os.environ["KAGGLE_USERNAME"] = "rag"
os.environ["KAGGLE_KEY"] = 'e7c'
os.environ["KERAS_BACKEND"] = "tensorflow" # Or "tensorflow" or "torch".
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset('gemma_2b_en', sequence_length=4096, add_end_token=True
)
generator = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
def run_inference(input, generate_tflite):
interp = interpreter.InterpreterWithCustomOps(
model_content=generate_tflite,
custom_op_registerers=tf_text.tflite_registrar.SELECT_TFTEXT_OPS)
interp.get_signature_list()
preprocessor_output = preprocessor.generate_preprocess(
input, sequence_length=preprocessor.sequence_length
)
generator = interp.get_signature_runner('serving_default')
output = generator(preprocessor_output)
output = preprocessor.generate_postprocess(output["output_0"])
print("\nGenerated with TFLite:\n", output)
generate_function = generator.make_generate_function()
concrete_func = generate_function.get_concrete_function({
"token_ids": tf.TensorSpec([None, 4096]),
"padding_mask": tf.TensorSpec([None, 4096])
})
converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func],
generator)
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
]
converter.allow_custom_ops = True
converter.target_spec.experimental_select_user_tf_ops = ["UnsortedSegmentJoin", "UpperBound"]
converter._experimental_guarantee_all_funcs_one_use = True
generate_tflite = converter.convert()
run_inference("I'm enjoying a", generate_tflite)
with open('unquantized_mistral.tflite', 'wb') as f:
f.write(generate_tflite)
3. Failure after conversion
I am getting this error:
tensorflow/core.py":65:1))))))))))))))))))))))))))]): error: missing attribute 'value' LLVM ERROR: Failed to infer result type(s). Aborted (core dumped)
5. (optional) Any other info / logs
2024-02-22 06:34:41.094712: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2024-02-22 06:34:41.094742: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2024-02-22 06:34:41.095691: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmp58p378bn
2024-02-22 06:34:41.140303: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2024-02-22 06:34:41.140329: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: /tmp/tmp58p378bn
2024-02-22 06:34:41.233389: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-02-22 06:34:41.264724: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-22 06:34:43.697440: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/tmp58p378bn
2024-02-22 06:34:44.189111: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 3093423 microseconds.
2024-02-22 06:34:45.009212: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
loc(fused["ReadVariableOp:", callsite("decoder_block_0_1/attention_1/attention_output_1/Cast/ReadVariableOp@__inference_generate_step_12229"("/workspace/gem.py":38:1) at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":258:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":235:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":212:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":214:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py":816:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py":42:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":157:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_decoder_block.py":147:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py":816:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py":42:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":157:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_attention.py":193:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py":816:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py":42:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":157:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/core/einsum_dense.py":218:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/numpy.py":2414:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/numpy.py":90:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/numpy.py":91:1 at "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/core.py":65:1))))))))))))))))))))))))))]): error: missing attribute 'value'
LLVM ERROR: Failed to infer result type(s).
Aborted (core dumped)
Hi @RageshAntonyHM,
I am trying to reproduce the issue while I had another error ModuleNotFoundError: No module named 'keras_nlp.backend, could you please confirm the version of it.
Thank You
@LakshmiKalaKadali
it is keras 3.0.5 and installed keras-nlp via pip install git+https://github.com/keras-team/keras-nlp (0.8.1)
@LakshmiKalaKadali
first install pip install git+https://github.com/keras-team/keras-nlp and then update Keras (pip install -U keras)
Then install tensorflow-datasets also @LakshmiKalaKadali
Also crashing in Colab with or without quantization.
@farmaker47
This conversion pipeline needs lot of Vram. At Least 24 GB.
@LakshmiKalaKadali any updates on this please?
@RageshAntonyHM I got same crash in colab A100(40GB GPU RAM).
@urim85
Yeah. Actually, till it is crashing for me in 48 GB RTX 6000.
(What I told to @farmaker47 was, it will crash prematurally if VRAM is low. But also crashes in final step even if you have enough VRAM)
I saw that training is working OK having installed first TensorFlow nightly version (2.17.0-dev20240223). @RageshAntonyHM can you try with nightly version and check again the conversion?
@farmaker47
How to install TensorFlow nightly version? I tried pip install tf-nightly, but I am getting error
File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/core.py", line 5, in
Name: tf-nightly Version: 2.17.0.dev20240223
I work with Colab. So it is
!pip install tf-nightly !pip install -q --upgrade keras-nlp !pip install -q -U keras>=3
@farmaker47
Now, again I am getting that first mentioned error
could you please share your notebook link ?
The colab is from this example
https://ai.google.dev/gemma/docs/lora_tuning
I have changed nothing. So the idea is if you install tf-nightly the error for conversion disappears? I don't understand from your previous answer if the error is during tf-nightly installation or during conversion.
@farmaker47
I hope some package conflicts ,like some packages reinstall 'stable' version of tensorflow. Let me check
@farmaker47
I able to ran inference already. my problem is, i need to create a TFlite model for Gemma 2B. I think there is some problem still in conversion
i am very new to AI and even python.
@farmaker47
I able to ran inference already. my problem is, i need to create a TFlite model for Gemma 2B. I think there is some problem still in conversion
i am very new to AI and even python.
Then we have to wait a little bit so the TF team solves this and provide us the tf-nightly version we can use to convert it.
@LakshmiKalaKadali
import keras_nlp.backend import ops
is not needed. Sorry
But when using all nightly versions, I got some "GraphDef" issue
a minimal script to reproduce the issue
import keras
import keras_nlp
import tensorflow as tf
os.environ["KAGGLE_USERNAME"] = '....'
os.environ["KAGGLE_KEY"] = '...'
os.environ["KERAS_BACKEND"] = "tensorflow"
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
tf.saved_model.save(gemma_lm.backbone, '/tmp/gemma_saved_model/')
f = tf.lite.TFLiteConverter.from_saved_model('/tmp/gemma_saved_model/').convert()
I tested with tf-2.15, 2.16, and 2.17 nightly and their corresponding packages. None of them works.
Hi @pkgoogle,
I have reproduced the issue in Colab with TF 2.15, the session crashed at the step generator = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en") . Please take a look.
Thank You
Adding @advaitjain and @paulinesho for visibility.
I believe colab is running out of memory for @LakshmiKalaKadali 's case,
In attempting to replicate the below, I am running into tensorflow-text installation issues (apparently the Gemma tokenizer uses it for the tokenizer), this may be because of the new 2.16 release.
import keras
import keras_nlp
import tensorflow as tf
os.environ["KAGGLE_USERNAME"] = '....'
os.environ["KAGGLE_KEY"] = '...'
os.environ["KERAS_BACKEND"] = "tensorflow"
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
tf.saved_model.save(gemma_lm.backbone, '/tmp/gemma_saved_model/')
f = tf.lite.TFLiteConverter.from_saved_model('/tmp/gemma_saved_model/').convert()
my error:
TypeError: <class 'keras_nlp.src.models.gemma.gemma_tokenizer.GemmaTokenizer'> could not be deserialized properly. Please ensure that components that are Python object instances (layers, models, etc.) returned by `get_config()` are explicitly deserialized in the model's `from_config()` method.
config={'module': 'keras_nlp.src.models.gemma.gemma_tokenizer', 'class_name': 'GemmaTokenizer', 'config': {'name': 'gemma_tokenizer', 'trainable': True, 'dtype': 'int32', 'proto': None, 'sequence_length': None}, 'registered_name': 'keras_nlp>GemmaTokenizer', 'assets': ['assets/tokenizer/vocabulary.spm'], 'weights': None}.
Exception encountered: Error when deserializing class 'GemmaTokenizer' using config={'name': 'gemma_tokenizer', 'trainable': True, 'dtype': 'int32', 'proto': None, 'sequence_length': None}.
I think there is an answer here that it is working: https://github.com/keras-team/keras/issues/19108 It is based on this comment: https://github.com/keras-team/keras/issues/19108#issuecomment-1913421572
So my code now is:
model.export("test", "tf_saved_model")
converter = tf.lite.TFLiteConverter.from_saved_model("test")
tflite_model = converter.convert()
with open("model.tflite", "wb") as f:
f.write(tflite_model)
With the above the conversion finishes and the .tflite model is running into android. I have not used quantization since it is failing into android.
@farmaker47
I ran like this:
import os
import keras
import os
import numpy as np
import keras_nlp
import tensorflow as tf
import tensorflow_text as tf_text
from tensorflow import keras
from tensorflow.lite.python import interpreter
import time
os.environ["KAGGLE_USERNAME"] = "rag"
os.environ["KAGGLE_KEY"] = 'e7c'
os.environ["KERAS_BACKEND"] = "tensorflow" # Or "tensorflow" or "torch".
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset('gemma_2b_en', sequence_length=4096, add_end_token=True
)
model = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
print("exporting")
model.export("test", "tf_saved_model")
print("converting")
converter = tf.lite.TFLiteConverter.from_saved_model("test")
tflite_model = converter.convert()
print("writiing")
with open("model.tflite", "wb") as f:
f.write(tflite_model)
Iit fails at "converting" with this error "GPU:0 in order to run Identity: Dst tensor is not initialized. [Op:Identity] name: " :
2024-02-28 17:40:21.522813: I external/local_tsl/tsl/framework/bfc_allocator.cc:1114] Stats:
Limit: 23553966080
InUse: 23553959680
MaxInUse: 23553959680
NumAllocs: 1629
MaxAllocSize: 2097152000
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2024-02-28 17:40:21.522846: W external/local_tsl/tsl/framework/bfc_allocator.cc:497] ****************************************************************************************************
Traceback (most recent call last):
File "/workspace/gem.py", line 22, in <module>
converter = tf.lite.TFLiteConverter.from_saved_model("test")
File "/usr/local/lib/python3.10/dist-packages/tensorflow/lite/python/lite.py", line 2087, in from_saved_model
saved_model = _load(saved_model_dir, tags)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/saved_model/load.py", line 912, in load
result = load_partial(export_dir, None, tags, options)["root"]
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/saved_model/load.py", line 1043, in load_partial
loader = Loader(object_graph_proto, saved_model_proto, export_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/saved_model/load.py", line 226, in __init__
self._restore_checkpoint()
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/saved_model/load.py", line 561, in _restore_checkpoint
load_status = saver.restore(variables_path, self._checkpoint_options)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/checkpoint/checkpoint.py", line 1479, in restore
checkpoint=checkpoint, proto_id=0).restore(self._graph_view.root,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/checkpoint/restore.py", line 62, in restore
restore_ops = self._restore_descendants(reader)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/checkpoint/restore.py", line 463, in _restore_descendants
current_position.checkpoint.restore_saveables(
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/checkpoint/checkpoint.py", line 379, in restore_saveables
registered_savers).restore(self.save_path_tensor, self.options)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/checkpoint/functional_saver.py", line 499, in restore
restore_ops = restore_fn()
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/checkpoint/functional_saver.py", line 467, in restore_fn
ret = restore_fn(restored_tensors)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/training/saving/saveable_object_util.py", line 747, in _restore_from_tensors
return saveable_object_to_restore_fn(self.saveables)(restored_tensors)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/training/saving/saveable_object_util.py", line 784, in _restore_from_tensors
restore_ops[saveable.name] = saveable.restore(
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/training/saving/saveable_object_util.py", line 602, in restore
ret = restore_fn(restored_tensor_dict)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 779, in _restore_from_tensors
restored_tensor = array_ops.identity(
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/weak_tensor_ops.py", line 88, in wrapper
return op(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 5883, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from
/job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run Identity: Dst tensor is not initialized. [Op:Identity] name:
Am I doing something wrong ?
You can skip the Kaggle_key...😀
I think it's a memory error
@farmaker47
I rented 48 GB GPU, now got another error :
converting
2024-02-28 17:51:39.439053: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2024-02-28 17:51:39.439136: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2024-02-28 17:51:39.440216: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: test
2024-02-28 17:51:39.459869: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2024-02-28 17:51:39.459902: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: test
2024-02-28 17:51:39.760067: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-02-28 17:51:39.808678: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-28 17:51:44.034223: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: test
2024-02-28 17:51:44.392812: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 4952596 microseconds.
2024-02-28 17:51:44.754857: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
Summary on the non-converted ops:
---------------------------------
* Accepted dialects: tfl, builtin, func
* Non-Converted Ops: 220, Total Ops 2416, % non-converted = 9.11 %
* 184 ARITH ops, 36 TF ops
- arith.constant: 184 occurrences (f32: 153, i32: 31)
- tf.StridedSlice: 36 occurrences (i1: 18, i32: 18)
(f32: 127)
(f32: 36)
(i1: 18)
(f32: 36, i32: 20)
(i32: 72)
(f32: 18)
(f32: 37)
(i1: 18)
(f32: 19, i1: 18)
(f32: 127)
(f32: 1, i32: 90)
(f32: 18)
(i1: 18)
(i32: 1)
(f32: 37)
(i32: 19)
(f32: 272)
(f32: 36, i32: 90)
(f32: 18)
(i1: 18)
(f32: 252)
(f32: 18)
(i32: 180)
(f32: 18)
(f32: 18)
(f32: 36)
(f32: 37)
(f32: 37)
(f32: 36, i32: 180)
(f32: 54)
(f32: 90)
(i32: 54)
(f32: 18)
Killed
The process terminates with "killed" message. Didn't enter "writing" !
@RageshAntonyHM this looks like the OS terminated the process, maybe due to memory consumption/cpu time limitation?
@RageshAntonyHM looks like it is still memory and compute issue
I think there is an answer here that it is working: keras-team/keras#19108 It is based on this comment: keras-team/keras#19108 (comment)
So my code now is:
model.export("test", "tf_saved_model") converter = tf.lite.TFLiteConverter.from_saved_model("test") tflite_model = converter.convert() with open("model.tflite", "wb") as f: f.write(tflite_model)With the above the conversion finishes and the .tflite model is running into android. I have not used quantization since it is failing into android.
I can cofirm that I could get tflite by using:
gemma_lm.backbone.export('/tmp/gemma_saved_model')
instead of
tf.saved_model.save(gemma_lm.backbone, '/tmp/gemma_saved_model/')
Note that the converter seems not memory efficient; I observed more than 90 GiB virtual memory was needed on my desktop machine.
@freedomtan can you share your working colab here
@freedomtan can you share your working colab here
nope, because I tested it with a simple script on my local machine; didn't try to deal with memory issues in Colab :-)
import keras
import keras_nlp
import tensorflow as tf
os.environ["KAGGLE_USERNAME"] = '....'
os.environ["KAGGLE_KEY"] = '...'
os.environ["KERAS_BACKEND"] = "tensorflow"
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
gemma_lm.backbone.export('/tmp/gemma_saved_model')
tflite_model = tf.lite.TFLiteConverter.from_saved_model('/tmp/gemma_saved_model/').convert()
with open("model.tflite", "wb") as f:
f.write(tflite_model)
For test, I modified keras_nlp to have fixed tensor dimensions. That's it.