GPULlama3.java icon indicating copy to clipboard operation
GPULlama3.java copied to clipboard

GPU-accelerated Llama3.java inference in pure Java using TornadoVM.

Results 21 GPULlama3.java issues
Sort by recently updated
recently updated
newest added

The main optimization is done by using all the threads, instead of only the first global thread, to calculate the scaling factor. This avoids thread divergence and the need to...

Correct links should be: Qwen3 (1.7B) - FP16: https://huggingface.co/ggml-org/Qwen3-1.7B-GGUF/resolve/main/Qwen3-1.7B-f16.gguf Qwen3 (4B) - FP16: https://huggingface.co/ggml-org/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-f16.gguf Qwen3 (8B) - FP16: https://huggingface.co/ggml-org/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-f16.gguf

- Introduced `gpullama3-architecture.svg` and `gpullama3-architecture-light.svg` diagrams. - Improved README with a simplified model collection reference. - Moved detailed GPU requirements and CLI options to `RUN_DEBUG.md` for clarity.

```bash ./llama-tornado --model gemma-3-1b-it-f16.gguf --prompt "who are you" --max-tokens 30 --top-p 0.9 ```

Implement complete Q4_0 quantization support following the same pattern as Q8_0: Core Q4_0 Infrastructure: - Add Q4_0TornadoTensor for GPU tensor representation with 4-bit quantization - Implement Q4_0LayerPlanner base class for...

This PR adds a new JavaFX GUI for running inference with `GPULlama3` (for issue #24). It adds a new package `com.example.gui` containing all the new classes for the chatbox GUI,...

Tooling

**Describe the bug** Tokenizer: Phi3Tokenizer Loading model weights in TornadoVM format (loading F16) Starting TornadoVM initialization... TornadoVM GPU execution plan creation: 619.22 ms Java to GPU JIT compiler warmup: 6147.02...

bug