GPULlama3.java by beehive-lab

GPU-accelerated LLM inference in pure Java

Created 1 year ago

265 stars

Top 96.3% on SourcePulse

Project Summary

This project provides GPU-accelerated Large Language Model (LLM) inference directly within the Java ecosystem using TornadoVM. It targets Java developers seeking to integrate high-performance LLMs like Llama3, Mistral, and others into their applications without relying on Python, offering efficient execution on GPUs.

How It Works

GPULlama3.java leverages TornadoVM to automatically compile and accelerate Java code for GPU execution. It builds upon the Llama3.java library, enabling inference for various LLM architectures (Llama3, Mistral, Qwen, Phi-3, Granite) in GGUF format. The core advantage lies in bringing native GPU acceleration for LLMs to the Java Virtual Machine, facilitating seamless integration with Java frameworks.

Quick Start & Requirements

Primary install/run command: Clone the repository and use provided CLI scripts (llama-tornado) or JBang for execution. Maven dependency is also available.
Non-default prerequisites: Java 21 (or 25 for specific features), TornadoVM SDK (with OpenCL or PTX backends), GCC/G++ 13+.
Estimated setup time or resource footprint: Requires installation of Java, TornadoVM SDK, and potentially build tools. Setup involves cloning, SDK installation, and verification.
Links: TornadoVM SDKMAN! page, Hugging Face model collections.

Highlighted Details

Supports Llama3, Mistral, Devstral 2, Qwen2.5, Qwen3, Phi-3, IBM Granite 3.2+/4.0 models in GGUF format.
Achieves up to 117.65 tokens/s on an RTX 5090 for Llama-3.2-1B-Instruct (FP16).
Offers direct integration with Quarkus and LangChain4j (v1.7.1+).
Provides cross-platform support for NVIDIA (OpenCL/PTX), Intel (OpenCL), and Apple Silicon (Metal/OpenCL).

Maintenance & Community

The project is built upon Llama3.java by Alfonso² Peterssen. Development is partially funded by several EU & UKRI grants, including Horizon Europe and UKRI AERO. A roadmap is available for future development.

Licensing & Compatibility

The project is released under the MIT license, which is permissive for commercial use and integration into closed-source applications.

Limitations & Caveats

This project is in the early stages of Java's AI integration. Support for Intel, Apple Silicon, and AMD GPUs is marked as Work In Progress (WIP). Users may encounter GPU Out-of-Memory errors, requiring adjustments to GPU memory allocation or model quantization. Performance is highly dependent on the specific hardware and model used.

GPULlama3.java by beehive-lab

Explore Similar Projects

picollm by Picovoice

eLLM by lucienhuangfu

llama.cpp-deepseek-v4-flash by antirez

ollama-intel-gpu by mattcurf

llamacpp-rocm by lemonade-sdk

prima.cpp by Lizonghang

xFasterTransformer by intel

candle-vllm by EricLBuehler

TornadoVM by beehive-lab

amd-strix-halo-toolboxes by kyuz0

MiniCPM by OpenBMB

ipex-llm by intel