wllama  by ngxson

WebAssembly binding for on-browser LLM inference

Created 1 year ago
888 stars

Top 40.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides WebAssembly (WASM) bindings for llama.cpp, enabling large language model (LLM) inference directly within web browsers without requiring a backend server or GPU. It targets web developers and researchers looking to integrate LLM capabilities into client-side applications, offering a no-backend solution for LLM inference.

How It Works

wllama leverages WebAssembly and SIMD instructions to run llama.cpp models efficiently in the browser. It compiles the C++ llama.cpp library into WASM, allowing it to execute within a web worker to avoid blocking the UI thread. The library supports both high-level APIs for completions and embeddings, and low-level control over KV cache and sampling. Models can be split into smaller chunks for faster parallel downloads and to overcome the 2GB ArrayBuffer size limit.

Quick Start & Requirements

Highlighted Details

  • Runs LLMs directly in the browser using WebAssembly SIMD.
  • No backend or GPU required for inference.
  • High-level (completions, embeddings) and low-level (tokenization, KV cache) APIs.
  • Supports parallel model downloads and automatic multi-threading based on browser support.
  • Models larger than 2GB can be split into smaller files.

Maintenance & Community

  • Project maintained by ngxson.
  • Active development indicated by recent README updates and feature TODOs.
  • Community links (Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

  • The project appears to be MIT licensed, based on the package.json reference. However, it bundles llama.cpp which has its own license (likely MIT or similar permissive).
  • Compatible with standard web development workflows and frameworks.

Limitations & Caveats

  • No WebGPU support currently, though it's a future possibility.
  • Models exceeding 2GB must be split using external tools (llama-gguf-split).
  • Multi-threading requires specific COOP/COEP headers to be set on the server.
  • IQ quantized models are not recommended due to potential performance and quality issues.
Health Check
Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
2
Star History
61 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
7 more.

dalai by cocktailpeanut

0%
13k
Local LLM inference via CLI tool and Node.js API
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.