lm-format-enforcer  by noamgat

Format enforcer for language model outputs (JSON, regex, etc.)

created 1 year ago
1,860 stars

Top 23.8% on sourcepulse

GitHubView on GitHub
Project Summary

This library enforces structured output formats (JSON Schema, Regex) for language models, targeting developers and researchers needing reliable, predictable LLM responses. It enhances LLM usability by filtering token generation at each step, ensuring adherence to specified formats while minimizing constraints on the model's creative freedom.

How It Works

The core mechanism combines a character-level parser with a tokenizer's prefix tree. The character-level parser defines the valid sequence of characters for a given format (e.g., JSON Schema, Regex). Simultaneously, a prefix tree represents all possible token sequences the LLM can generate. By intersecting these two structures, the library identifies valid tokens that advance both the format parsing and the LLM's generation, effectively filtering out invalid token choices at each generation step. This approach allows the LLM to control whitespace and field ordering, potentially improving output quality.

Quick Start & Requirements

Highlighted Details

  • Supports integration with Transformers, LangChain, LlamaIndex, llama.cpp, vLLM, Haystack, and TensorRT-LLM.
  • Handles JSON Schema (including optional fields, nested structures, and recursive classes), JSON Mode, and Regular Expressions.
  • Supports batched generation and beam searches.
  • Allows LLMs to control JSON whitespace and field ordering, reducing potential hallucinations.
  • Offers detailed diagnostics to analyze the impact of format enforcement on token selection.

Maintenance & Community

  • Active development with contributions from multiple individuals.
  • Integration into vLLM server noted.
  • Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Cannot be used with API-based solutions like OpenAI ChatGPT due to the requirement for direct logit access.
  • Regex syntax support is not 100% complete due to reliance on interegular.
  • The Regex parser can only generate characters present in the tokenizer's vocabulary.
Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
72 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
21 more.

guidance by guidance-ai

0.1%
21k
Guidance is a programming paradigm for steering LLMs
created 2 years ago
updated 1 day ago
Feedback? Help us improve.