lm-format-enforcer  by noamgat

Format enforcer for language model outputs (JSON, regex, etc.)

Created 2 years ago
1,923 stars

Top 22.8% on SourcePulse

GitHubView on GitHub
Project Summary

This library enforces structured output formats (JSON Schema, Regex) for language models, targeting developers and researchers needing reliable, predictable LLM responses. It enhances LLM usability by filtering token generation at each step, ensuring adherence to specified formats while minimizing constraints on the model's creative freedom.

How It Works

The core mechanism combines a character-level parser with a tokenizer's prefix tree. The character-level parser defines the valid sequence of characters for a given format (e.g., JSON Schema, Regex). Simultaneously, a prefix tree represents all possible token sequences the LLM can generate. By intersecting these two structures, the library identifies valid tokens that advance both the format parsing and the LLM's generation, effectively filtering out invalid token choices at each generation step. This approach allows the LLM to control whitespace and field ordering, potentially improving output quality.

Quick Start & Requirements

Highlighted Details

  • Supports integration with Transformers, LangChain, LlamaIndex, llama.cpp, vLLM, Haystack, and TensorRT-LLM.
  • Handles JSON Schema (including optional fields, nested structures, and recursive classes), JSON Mode, and Regular Expressions.
  • Supports batched generation and beam searches.
  • Allows LLMs to control JSON whitespace and field ordering, reducing potential hallucinations.
  • Offers detailed diagnostics to analyze the impact of format enforcement on token selection.

Maintenance & Community

  • Active development with contributions from multiple individuals.
  • Integration into vLLM server noted.
  • Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Cannot be used with API-based solutions like OpenAI ChatGPT due to the requirement for direct logit access.
  • Regex syntax support is not 100% complete due to reliance on interegular.
  • The Regex parser can only generate characters present in the tokenizer's vocabulary.
Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
50 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Vasek Mlejnsky Vasek Mlejnsky(Cofounder of E2B).

super-rag by superagent-ai

0%
384
RAG pipeline for AI apps
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.