tfjs Different Model Outputs on Android vs. iOS with tfjs

I am experiencing a significant discrepancy in the logits and probabilities output when running the same TensorFlow.js model on Android and iOS. The model, a self-trained MobileNetV3 (large) from PyTorch, performs as expected on iOS and in a PyTorch Jupyter Notebook but produces different results on Android. I convert the model from PyTorch to ONNX and then to TensorFlow.js.

To troubleshoot, I saved the preprocessed tensor from Android and used it on iOS, where it worked correctly. Conversely, the iOS tensor failed on Android. This rules out preprocessing issues, suggesting either improper weight handling on Android or an issue with the predict function.

System information

I am using an iPhone 15 Pro and for Android Samsung S20 (SM-G980F)

Snippet from my package.json:

"@tensorflow/tfjs": "^4.15.0",
"@tensorflow/tfjs-core": "^4.15.0",
"@tensorflow/tfjs-react-native": "^1.0.0",

Describe the current behavior The model outputs consistent and expected results on iOS and in the Jupyter Notebook but produces different and incorrect results on Android.

Describe the expected behavior The model should produce consistent logits and probabilities across all platforms, including Android, as it does on iOS and in the PyTorch Jupyter Notebook.

Standalone code to reproduce the issue

const savePredictions = async (logits, probabilities, fileName, variants, processedImage) => {
  try {
    const logitsData = await logits.array();
    const probabilitiesData = await probabilities.array();
    const processedImageData = await processedImage.array();

    const predictionsJSON = {
      variants: variants,
      processedImage: processedImageData,
      logits: logitsData,
      probabilities: probabilitiesData,
    };
    const tensorJSON = JSON.stringify(predictionsJSON);
    await FileSystem.writeAsStringAsync(fileName, tensorJSON);
    console.log('Predictions saved:', fileName, 'in', FileSystem.documentDirectory);
  } catch (error) {
    console.error('Error:', error);
  }
};

const processImage = async (uri: string): Promise<tf.Tensor> => {
  try {
    // rescale picture to model trained picture size
    const resizedImg = await manipulateAsync(
      uri,
      [{ resize: { width: trainingSizes.img_width, height: trainingSizes.img_height } }],
      { compress: 0.6, format: SaveFormat.JPEG, base64: true }
    );

    const imageTensor = tf.tidy(() => {
      const rescaledBase64 = `data:image/jpeg;base64,${resizedImg.base64}`.split(',')[1];
      const uint8array = tf.util.encodeString(rescaledBase64, 'base64').buffer;
      let tensor = decodeJpeg(new Uint8Array(uint8array));
      tensor = tf.image.resizeBilinear(tensor, [
        trainingSizesEN.img_height,
        trainingSizesEN.img_width,
      ]);
      tensor = tensor.div(255.0);
      tensor = tensor
        .sub(tf.tensor1d([0.485, 0.456, 0.406]))
        .div(tf.tensor1d([0.229, 0.224, 0.225]));
      tensor = tensor.transpose([2, 0, 1]).expandDims(0);
      return tensor;
    });

    //console.log('processImage memory:', tf.memory());
    return imageTensor;
  } catch (error) {
    console.error('Error on preprocessing image:', error);
    throw error;
  }
};


const predictImage = async (
  model: tf.GraphModel | null,
  processedImage: tf.Tensor,
  variants: string[],
): Promise<string[]> => {
  try {
    if (!model) {
      throw new Error('Modell not loaded');
    }
    //Overwrite processedImage with test data
    /*
    const testTensorData: number[][][][] = predictionJSON_Android[
      'processedImage'
    ] as number[][][][];
    const testTensor = tf.tensor4d(testTensorData);
    processedImage = testTensor;
    */

    // Mask non relevant classes
    const maskArray = Object.values(classLabels).map((label) =>
      variants.includes(label) ? 1 : 0
    );
    const maskTensor = tf.tensor(maskArray, [1, maskArray.length]);
    const modelInput = { input: processedImage, mask: maskTensor };
    const tidyResult = tf.tidy(() => {
      const logits = model.predict(modelInput) as tf.Tensor;
      const probabilities = tf.softmax(logits);
      return { logits, probabilities };
    });

    await savePredictions(
      tidyResult.logits,
      tidyResult.probabilities,
      FileSystem.documentDirectory + 'prediction.json',
      variants,
      processedImage
    );

    tidyResult.logits.dispose();
    maskTensor.dispose();
    tf.dispose(processedImage);

    const predictionArrayBuffer = await tidyResult.probabilities.data();
    tidyResult.probabilities.dispose();

    const predictionArray = Array.from(predictionArrayBuffer);
    const classLabelsArray = Object.values(classLabels);

    const variantPredictions = predictionArray
      .map((probability, index) => ({ label: classLabelsArray[index], probability }))
      .filter((prediction) => cardVariants.includes(prediction.label))
      .sort((a, b) => b.probability - a.probability);

    variantPredictions.forEach((variant) => {
      console.log(`Probillity for ${variant.label}: ${variant.probability}`);
    });

    const sortedLabels = variantPredictions.map((prediction) => prediction.label);
    return sortedLabels;
  } catch (error) {
    console.error('Error on prediction:', error);
    throw error;
  }
};

....
//Loading model
const loadModel = async () => {
  try {
    const ioHandler = bundleResourceIO(modelJson as tf.io.ModelJSON, [
      modelWeights1,
      modelWeights2,
      modelWeights3,
      modelWeights4,
    ]);
    const model= await tf.loadGraphModel(ioHandler);
    return model;
  } catch (error) {
    console.error('Error on loading model:', error);
    return null;
  }
};

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. prediction_ios.json prediction_ios_with_android_tensor.json prediction_android.json prediction_android_with_ios_tensor.json

Jul 24 '24 10:07 ertan95

I have same problem on all devices based on Samsung Exynos chipset platforms and i reproduce same problem on POCO c40, on snapdragon or ios device it's work as expected. Project run on angular with converted models on tfjs 4.17.0...

try { this.model = await tfconv.loadGraphModel(path, { requestInit, onProgress: (value) => { console.log('[MODEL] loading: ' + getModelNameFromPath(path) + ': ' + (value * 100) + '%'); this.updateLoadingProgress(value); } }); } catch (e) { console.error('[MODEL] error loading model', e); }

const res = await modelLoad.executeAsync({[config.inputsName]: inputTensor}, config.outputs);

Jul 25 '24 13:07 oleksandr-ravin

I downgrade libs to 3.3.0 version and it works! on version 3.11.0 still problem. Version between 3.3.0 and 3.11.0 i didn't check

Jul 26 '24 04:07 oleksandr-ravin

Hi, @ertan95, @oleksandr-ravin

I apologize for the delayed response and thank you for bringing this issue to our attention with valuable analysis and insights, if possible could you please help us with your Github repo along with comprehensive steps to reproduce the same behavior from our end to investigate this behavior further ?

Thank you for your cooperation and patience.

Aug 01 '24 15:08 gaikwadrahul8

Hi @gaikwadrahul8 for my models "outputs": [ "StatefulPartitionedCall/model_1/zoomin_type/Softmax", "StatefulPartitionedCall/model_1/sectors_quality/Sigmoid", "StatefulPartitionedCall/model_1/body_type/Softmax", "StatefulPartitionedCall/model_1/out_of_distribution/Sigmoid", "StatefulPartitionedCall/model_1/spheric_sectors_onehot_encoded/Softmax" ],

i finished test all version and 3.3.0 it's latest version witch give me correct values, need find wath change frome 3.3.0 to 3.4.0. if i understand correct it's problem on hardware translations and calculations other phone didn't have this truble.

Aug 02 '24 05:08 oleksandr-ravin

Hi @gaikwadrahul8 for my models "outputs": [ "StatefulPartitionedCall/model_1/zoomin_type/Softmax", "StatefulPartitionedCall/model_1/sectors_quality/Sigmoid", "StatefulPartitionedCall/model_1/body_type/Softmax", "StatefulPartitionedCall/model_1/out_of_distribution/Sigmoid", "StatefulPartitionedCall/model_1/spheric_sectors_onehot_encoded/Softmax" ],

i finished test all version and 3.3.0 it's latest version witch give me correct values, need find wath change frome 3.3.0 to 3.4.0. if i understand correct it's problem on hardware translations and calculations other phone didn't have this truble.

I will have a look at this one. A downgrade is not the best option for me due to other dependencies, but I will give it a try thanks for the solution!

Aug 25 '24 15:08 ertan95

Hi, @ertan95, @oleksandr-ravin

I apologize for the delayed response and thank you for bringing this issue to our attention with valuable analysis and insights, if possible could you please help us with your Github repo along with comprehensive steps to reproduce the same behavior from our end to investigate this behavior further ?

Thank you for your cooperation and patience.

Well I've described the steps to reproduce in my initial post. Basically train a MobileNetV3 and test it on ios and android with the same image you will get different outputs.

Aug 25 '24 15:08 ertan95

Hi, @ertan95, @oleksandr-ravin, @gaikwadrahul8 I also have same problem. and here're some test result I did. I hope this test result would help for fixing this problem

tfjs version test in tfjs 3.3.0 version, with webgl backend, get right result in tfjs 3.4.0 version, with webgl backend, get wrong result
-> something changes from 3.3.0 to 3.4.0 seem to affect this problem
backend test in tfjs 4.20 version, with webgl backend, get wrong result
in tfjs 4.20 version, with cpu backend, get right result
CPU chipset test in some mobile phone with Exynos, not all, i found this problem. i couldn't find this problem with Snap dragon so far

considering these test result, I guess there're some incompatible points between webgl backend and some Exynos version, and this is because of some changes applied when updating from tfjs 3.3.0 to 3.4.0.

Sep 26 '24 23:09 keunhyunkim

Had similar issues on some Samsung devices. Exynos or not did not seem to be the deciding factor. Some Exynos processors work. Some don´t. A55 does not work A54 does work. Both are using Exynos branded chips. It depends on the exact chip. Changing the version of tfjs did not help at all.

Ended up converting the model to onnx and using onnx runtime web instead of tfjs and that works out great on all devices so far. I would hope that even if that one fails on some devices too (did not happen yet) that then tfjs would work. So I can have one of the two runtimes work to cover all devices.

Oct 03 '24 13:10 shiomax