LLM inference engine for Java applications
Top 34.6% on sourcepulse
Jlama provides a modern LLM inference engine for Java developers, enabling direct integration of large language models into Java applications. It supports a wide range of popular LLM architectures and features like Paged Attention and Mixture of Experts, targeting developers who need to leverage LLMs within the Java ecosystem.
How It Works
Jlama leverages Java 21's Vector API for optimized inference performance. It supports various model formats, including Hugging Face's SafeTensors, and offers quantization (Q8, Q4) and precision options (F32, F16, BF16). The engine implements advanced techniques like Paged Attention and Mixture of Experts, aiming for efficient and scalable LLM execution within the JVM.
Quick Start & Requirements
jbang app install --force jlama@tjake
. Run models with jlama restapi <model_name>
.export JDK_JAVA_OPTIONS="--add-modules jdk.incubator.vector --enable-preview"
. Add Maven dependencies: jlama-core
and jlama-native
.Highlighted Details
Maintenance & Community
The project is maintained by T Jake Luciani. A roadmap includes support for more models, pure Java tokenizers, LoRA, GraalVM, and enhanced distributed inference.
Licensing & Compatibility
Licensed under the Apache License 2.0, permitting commercial use and integration with closed-source applications.
Limitations & Caveats
Requires Java 21 with preview features enabled, which may not be suitable for all production environments. The roadmap indicates features like GraalVM support are still under development.
1 month ago
1 week