java java tensorflow and maven version is org.tensorflow:libtensorflow:1.15.0 after session.runner.run() for many times, the memory grow higher and higher then oom

20241024-155157

my test code with language scala ,this is one predict, we will predict 100 QPS for a docker
val config = ConfigProto.newBuilder .putDeviceCount("CPU", Runtime.getRuntime.availableProcessors) .setInterOpParallelismThreads(8) .setIntraOpParallelismThreads(8) .setOperationTimeoutInMs(3000) .build

val options = RunOptions.newBuilder
  .setTimeoutInMs(5000)
  .build 

val modelBundle = SavedModelBundle
  .loader(s"$path")
  .withTags("serve")
  .withConfigProto(config.toByteArray)
  .withRunOptions(options.toByteArray)
  .load

val kernel = modelBundle.session

val data = Map("tensor1" -> Seq(0.1f,0.122f),……)
val runner = kernel.runner()
val inputTensorList: util.ArrayList[Tensor[java.lang.Float]] = new util.ArrayList[Tensor[java.lang.Float]]()
data.map{
  case (tensorName, featureId) => {

    val dataInput:FloatBuffer = FloatBuffer.allocate(featureId.size)
    featureId.foreach(featureValue => {
      dataInput.put(featureValue)
    })
    dataInput.asInstanceOf[Buffer].flip()
    val tensorShape:Array[Long] = Array(1,featureId.size)
    val tensor = Tensor.create(tensorShape,dataInput)
    runner.feed(tensorName,tensor)
    inputTensorList.add(tensor)
  }
}

for(i <- 0 until 2 ){
  runner.fetch("StatefulPartitionedCall",i)
}

val output = runner.run.asScala
val scores:Array[Float] = output.map(ten => {
  val tensorData: Array[Array[Float]] = ten.copyTo(Array.ofDim[Float](ten.shape()(0).toInt, ten.shape()(1).toInt))
  tensorData(0).head
}).toArray
inputTensorList.asScala.foreach(_.close())
output.foreach(_.close())

Oct 28 '24 09:10 hanfengatonline

Hi @hanfengatonline , it looks like you are still using TensorFlow 1.x and an older version of TF Java. This version is no longer supported, please take a look at the new version based on TensorFlow 2.x in this repo instead.

Oct 28 '24 13:10 karllessard

@karllessard ok,i will try to import TensorFlow 2.x instead of 1.x for my project

Nov 09 '24 04:11 hanfengatonline

I'm still experiencing this with TensorFlow 2.16.2. I'm using TensorFlow for Java 1.0.0

Sep 25 '25 13:09 DiazBejaranoD

What does your code look like?

Sep 25 '25 14:09 Craigacp

I'm performing inference on a number of Tensorflow-based models for different steps of NLP pipeline. Memory leak has been identified after migrating from Tf1 to Tf2 , most likely because JavaCPP triggers OutOfMemoryError when maxPhysicalBytes is reached, and that burden was not present in Tf1. These caused our client service to fail with 504 and led us to find the possible leak.

I've applied a number of fixing measures already but I believe there's still some leak happening.

Part of the code that has been patched is as follows for USE embeddings:

public float[] embedSequences(final Iterable<String> sequences) {
        final List<String> values = Lists.newArrayList(sequences);
        try (Tensor input = getInputs(values);
             Result output = bundle.session().runner()
                .feed("input", input)
                .fetch("output")
                .run()) {
            Tensor embedding = output.get(0);
            float[][] converted = StdArrays.array2dCopyOf((TFloat32) embedding);
            return converted[0];
        }
    }

I've isolated this one trying to see if Tensor embedding = output.get(0); was provoking any leak, but seems like try-with-resources on Result is covering that part.

Again, i'm using similar patterns in a bunch of different places, and I couldn't isolate where the problem comes from yet, as it only gets strongly reproduced in prod under sustained high load of conversations.

I've tried to mimic it on dev environments and profile the process using async-profiler, getting some flame graphs, starting to profile on iddle (no conversations) and warmed-up state and stopping after load and iddle state is back.

For a similar test of around 900 flowing utterances. This is the flame graph using Tf1:

This is the flame graph using Tf2 in the current version:

Main takeway I see is that Tf1 shows a much clearer leak symptom around allocate_output and allocate_tensor, and if green part (java) is zoomed in, there're some inference methods mentioned there for the different models.

On the other hand, with Tf2, looks like much more disseminated issue, apparently more complex to solve, but looks like most part of the leak may have gone away... it seems. But I can't be wrong here, i'm just trying to interpret the graphs. There's also a significant difference in the leaked bytes, going from around 500M to 12M, seems like.

Also see there's oneDNN showing up in the Tf2 flame graph, may it be worth to try ITEX_CACHE_ONEDNN_OBJECT=0 to see if leaks fades out?, and then I guess trying to find a good value for it that does not penalize performance..

any help is appreciated. Thanks.

Sep 26 '25 13:09 DiazBejaranoD

Please try to set the "org.bytedeco.javacpp.nopointergc" system property to "true".

Sep 26 '25 13:09 saudet

What GC algorithm are you using in the production system? ZGC causes the RSS to be triple counted on some platforms which can interact poorly with JavaCPP's memory tracking.

Sep 26 '25 13:09 Craigacp

Setting "org.bytedeco.javacpp.nopointergc" does not seem to solve it. GC is -XX:+UseG1GC. There's indeed physical memory growing for the process as we can see it both in our netadata reporting and using top's RES column for the process.

I believe the heaviest leak has been tackled already by cleaning up java code, but there's still some missing piece. In my tests, after some warm up, every consecutive set of tests elevates RES consumption 200-300MB.

Sep 26 '25 14:09 DiazBejaranoD

How does your getInputs call work? Is it using anything in eager mode?

Sep 26 '25 15:09 Craigacp

Not that I know.

private Tensor getInputs(List<String> values) {

        byte[][] input = new byte[values.size()][];
        for (int i = 0; i < values.size(); i++) {
            String val = values.get(i);
            input[i] = val.getBytes(StandardCharsets.UTF_8);
        }

        int batchSize = input.length;
        NdArray<byte[]> textNdArray = NdArrays.ofObjects(byte[].class, Shape.of(batchSize));

        for (int i = 0; i < batchSize; i++) {
            textNdArray.setObject(input[i], i);
        }
        return TString.tensorOfBytes(textNdArray);
    }

Sep 29 '25 09:09 DiazBejaranoD

UPDATE. Initially we had a number of Tensorflow objects that were not properly handled within try-with-resources blocks, which have already been addressed. In the current version, however, we still see a significant grow in RES memory for the process, in dev environments (this hasn't yet been exposed to real production-scale traffic). I started to see in async-profiler that much of the memory was being allocated for OneDNN purposes, which made me think of disabling this temporarily, falling back to the older eigen way of computation. TF_ENABLE_ONEDNN_OPTS = 0 This made memory stabilize around 15G, while having it enabled was making the RES memory grow to 18.7G and beyond with sort of simple tests (50 parallel conversations with 20 utterances to perform inference on each).

Is there any known issue/insight on OneDNN harming under certain circumstances?

Sep 29 '25 10:09 DiazBejaranoD

Unfortuantely I don't understand much about how the oneDNN implementation uses memory internally, but you might want to look at tensorflow/tensorflow to see if other people have hit the same issue in Python.

Oct 03 '25 19:10 Craigacp