GPULlama3.java  by beehive-lab

GPU-accelerated LLM inference in pure Java

Created 1 year ago
258 stars

Top 98.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides GPU-accelerated Large Language Model (LLM) inference directly within the Java ecosystem using TornadoVM. It targets Java developers seeking to integrate high-performance LLMs like Llama3, Mistral, and others into their applications without relying on Python, offering efficient execution on GPUs.

How It Works

GPULlama3.java leverages TornadoVM to automatically compile and accelerate Java code for GPU execution. It builds upon the Llama3.java library, enabling inference for various LLM architectures (Llama3, Mistral, Qwen, Phi-3, Granite) in GGUF format. The core advantage lies in bringing native GPU acceleration for LLMs to the Java Virtual Machine, facilitating seamless integration with Java frameworks.

Quick Start & Requirements

  • Primary install/run command: Clone the repository and use provided CLI scripts (llama-tornado) or JBang for execution. Maven dependency is also available.
  • Non-default prerequisites: Java 21 (or 25 for specific features), TornadoVM SDK (with OpenCL or PTX backends), GCC/G++ 13+.
  • Estimated setup time or resource footprint: Requires installation of Java, TornadoVM SDK, and potentially build tools. Setup involves cloning, SDK installation, and verification.
  • Links: TornadoVM SDKMAN! page, Hugging Face model collections.

Highlighted Details

  • Supports Llama3, Mistral, Devstral 2, Qwen2.5, Qwen3, Phi-3, IBM Granite 3.2+/4.0 models in GGUF format.
  • Achieves up to 117.65 tokens/s on an RTX 5090 for Llama-3.2-1B-Instruct (FP16).
  • Offers direct integration with Quarkus and LangChain4j (v1.7.1+).
  • Provides cross-platform support for NVIDIA (OpenCL/PTX), Intel (OpenCL), and Apple Silicon (Metal/OpenCL).

Maintenance & Community

The project is built upon Llama3.java by Alfonso² Peterssen. Development is partially funded by several EU & UKRI grants, including Horizon Europe and UKRI AERO. A roadmap is available for future development.

Licensing & Compatibility

The project is released under the MIT license, which is permissive for commercial use and integration into closed-source applications.

Limitations & Caveats

This project is in the early stages of Java's AI integration. Support for Intel, Apple Silicon, and AMD GPUs is marked as Work In Progress (WIP). Users may encounter GPU Out-of-Memory errors, requiring adjustments to GPU memory allocation or model quantization. Performance is highly dependent on the specific hardware and model used.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

2.2%
9k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.