wllama by ngxson

WebAssembly binding for on-browser LLM inference

Created 1 year ago

970 stars

Top 38.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Simon Willison

Coauthor of Django

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

This project provides WebAssembly (WASM) bindings for llama.cpp, enabling large language model (LLM) inference directly within web browsers without requiring a backend server or GPU. It targets web developers and researchers looking to integrate LLM capabilities into client-side applications, offering a no-backend solution for LLM inference.

How It Works

wllama leverages WebAssembly and SIMD instructions to run llama.cpp models efficiently in the browser. It compiles the C++ llama.cpp library into WASM, allowing it to execute within a web worker to avoid blocking the UI thread. The library supports both high-level APIs for completions and embeddings, and low-level control over KV cache and sampling. Models can be split into smaller chunks for faster parallel downloads and to overcome the 2GB ArrayBuffer size limit.

Quick Start & Requirements

Install via npm: npm i @wllama/wllama
Requires a web server to serve the WASM files and model data.
For multi-threading, Cross-Origin-Embedder-Policy and Cross-Origin-Opener-Policy headers must be configured.
Documentation: https://github.ngxson.com/wllama/
Examples: https://github.ngxson.com/wllama/examples/

Highlighted Details

Runs LLMs directly in the browser using WebAssembly SIMD.
No backend or GPU required for inference.
High-level (completions, embeddings) and low-level (tokenization, KV cache) APIs.
Supports parallel model downloads and automatic multi-threading based on browser support.
Models larger than 2GB can be split into smaller files.

Maintenance & Community

Project maintained by ngxson.
Active development indicated by recent README updates and feature TODOs.
Community links (Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The project appears to be MIT licensed, based on the package.json reference. However, it bundles llama.cpp which has its own license (likely MIT or similar permissive).
Compatible with standard web development workflows and frameworks.

Limitations & Caveats

No WebGPU support currently, though it's a future possibility.
Models exceeding 2GB must be split using external tools (llama-gguf-split).
Multi-threading requires specific COOP/COEP headers to be set on the server.
IQ quantized models are not recommended due to potential performance and quality issues.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

27 stars in the last 30 days