wllama  by ngxson

WebAssembly binding for on-browser LLM inference

created 1 year ago
784 stars

Top 45.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides WebAssembly (WASM) bindings for llama.cpp, enabling large language model (LLM) inference directly within web browsers without requiring a backend server or GPU. It targets web developers and researchers looking to integrate LLM capabilities into client-side applications, offering a no-backend solution for LLM inference.

How It Works

wllama leverages WebAssembly and SIMD instructions to run llama.cpp models efficiently in the browser. It compiles the C++ llama.cpp library into WASM, allowing it to execute within a web worker to avoid blocking the UI thread. The library supports both high-level APIs for completions and embeddings, and low-level control over KV cache and sampling. Models can be split into smaller chunks for faster parallel downloads and to overcome the 2GB ArrayBuffer size limit.

Quick Start & Requirements

Highlighted Details

  • Runs LLMs directly in the browser using WebAssembly SIMD.
  • No backend or GPU required for inference.
  • High-level (completions, embeddings) and low-level (tokenization, KV cache) APIs.
  • Supports parallel model downloads and automatic multi-threading based on browser support.
  • Models larger than 2GB can be split into smaller files.

Maintenance & Community

  • Project maintained by ngxson.
  • Active development indicated by recent README updates and feature TODOs.
  • Community links (Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

  • The project appears to be MIT licensed, based on the package.json reference. However, it bundles llama.cpp which has its own license (likely MIT or similar permissive).
  • Compatible with standard web development workflows and frameworks.

Limitations & Caveats

  • No WebGPU support currently, though it's a future possibility.
  • Models exceeding 2GB must be split using external tools (llama-gguf-split).
  • Multi-threading requires specific COOP/COEP headers to be set on the server.
  • IQ quantized models are not recommended due to potential performance and quality issues.
Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
1
Star History
107 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.