Need persistent HTTP server mode for deployments (like Ollama)
Very new here so if this already exists apologies...
The intent here is to allow JVM warm up to kick in...
Description: Currently, GPULlama3.java requires spawning a new JVM process for each inference request when wrapped in a web API. This causes 20-80s latency per request due to repeated JVM/TornadoVM/model loading overhead.
Request: Add a persistent server mode where:
- Model loads once at startup and stays in GPU memory
- HTTP server accepts inference requests without process restarts
- Similar to how Ollama operates (loads model once, serves all requests from same process)
Current workaround limitations:
- Flask + subprocess: 20-80s latency (JVM/model reload per request)
- Spring Boot + LangChain4j: Version incompatibility (langchain4j-gpu-llama3 requires Java 21, base image has Java 17)
Ideal solution: Built-in HTTP server (like Ollama) or Java 17-compatible LangChain4j integration
Ah just seen
https://github.com/beehive-lab/GPULlama3.java/pull/49
Hi @pfrydids and @petenorth,
Indeed #49 adds support for deploying GPULlama3.java via a RESTful API, but it’s currently a work in progress.
In the meantime, you can use the integration with LangChain4j (available since version 1.7.1, for application examples have a look here ). The integration with langchain4j enables seamless utilization of GPULlama3 from a java application without multiple JVM instances per request!
Also note that we’re currently working on an integration with Quarkus, which will provide an additional deployment option for GPULlama3.java in the near future.