Jlama  by tjake

LLM inference engine for Java applications

Created 2 years ago
1,235 stars

Top 31.8% on SourcePulse

GitHubView on GitHub
Project Summary

Jlama provides a modern LLM inference engine for Java developers, enabling direct integration of large language models into Java applications. It supports a wide range of popular LLM architectures and features like Paged Attention and Mixture of Experts, targeting developers who need to leverage LLMs within the Java ecosystem.

How It Works

Jlama leverages Java 21's Vector API for optimized inference performance. It supports various model formats, including Hugging Face's SafeTensors, and offers quantization (Q8, Q4) and precision options (F32, F16, BF16). The engine implements advanced techniques like Paged Attention and Mixture of Experts, aiming for efficient and scalable LLM execution within the JVM.

Quick Start & Requirements

  • CLI: Install via jbang app install --force jlama@tjake. Run models with jlama restapi <model_name>.
  • Java Project: Requires Java 21 or later. Enable preview features with export JDK_JAVA_OPTIONS="--add-modules jdk.incubator.vector --enable-preview". Add Maven dependencies: jlama-core and jlama-native.
  • Docs: https://www.jbang.dev/download/

Highlighted Details

  • Supports Gemma, Llama, Mistral, Mixtral, Qwen2, Granite, and GPT-2 models.
  • Implements Paged Attention, Mixture of Experts, and Tool Calling.
  • Offers OpenAI-compatible REST API and distributed inference capabilities.
  • Supports Hugging Face SafeTensors, various data types, and quantization (Q8, Q4).

Maintenance & Community

The project is maintained by T Jake Luciani. A roadmap includes support for more models, pure Java tokenizers, LoRA, GraalVM, and enhanced distributed inference.

Licensing & Compatibility

Licensed under the Apache License 2.0, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Requires Java 21 with preview features enabled, which may not be suitable for all production environments. The roadmap indicates features like GraalVM support are still under development.

Health Check
Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
3
Star History
26 stars in the last 30 days

Explore Similar Projects

Starred by Ross Wightman Ross Wightman(Author of timm; CV at Hugging Face), Awni Hannun Awni Hannun(Author of MLX; Research Scientist at Apple), and
1 more.

mlx-llm by riccardomusmeci

0%
459
LLM tools/apps for Apple Silicon using MLX
Created 2 years ago
Updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), and
11 more.

optillm by algorithmicsuperintelligence

0.5%
3k
Optimizing inference proxy for LLMs
Created 1 year ago
Updated 2 weeks ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.4%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 2 days ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.1%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.