lm-format-enforcer by noamgat

Format enforcer for language model outputs (JSON, regex, etc.)

Created 2 years ago

1,976 stars

Top 22.1% on SourcePulse

5 Experts Love This Project

youkaichao

Core Maintainer of vLLM

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

andreasjansson

Andreas Jansson

Cofounder of Replicate

tholor

Cofounder of deepset

and 1 more!

Project Summary

This library enforces structured output formats (JSON Schema, Regex) for language models, targeting developers and researchers needing reliable, predictable LLM responses. It enhances LLM usability by filtering token generation at each step, ensuring adherence to specified formats while minimizing constraints on the model's creative freedom.

How It Works

The core mechanism combines a character-level parser with a tokenizer's prefix tree. The character-level parser defines the valid sequence of characters for a given format (e.g., JSON Schema, Regex). Simultaneously, a prefix tree represents all possible token sequences the LLM can generate. By intersecting these two structures, the library identifies valid tokens that advance both the format parsing and the LLM's generation, effectively filtering out invalid token choices at each generation step. This approach allows the LLM to control whitespace and field ordering, potentially improving output quality.

Quick Start & Requirements

Install via pip: pip install lm-format-enforcer
Requires Python 3.x.
For GPU usage with Hugging Face Transformers, install transformers, torch, huggingface_hub, optimum, and auto-gptq (with CUDA 11.8 specified).
Official Colab Notebook: https://colab.research.google.com/github/noamgat/lm-format-enforcer/blob/main/docs/notebooks/transformers_example.ipynb

Highlighted Details

Supports integration with Transformers, LangChain, LlamaIndex, llama.cpp, vLLM, Haystack, and TensorRT-LLM.
Handles JSON Schema (including optional fields, nested structures, and recursive classes), JSON Mode, and Regular Expressions.
Supports batched generation and beam searches.
Allows LLMs to control JSON whitespace and field ordering, reducing potential hallucinations.
Offers detailed diagnostics to analyze the impact of format enforcement on token selection.

Maintenance & Community

Active development with contributions from multiple individuals.
Integration into vLLM server noted.
Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Cannot be used with API-based solutions like OpenAI ChatGPT due to the requirement for direct logit access.
Regex syntax support is not 100% complete due to reliance on interegular.
The Regex parser can only generate characters present in the tokenizer's vocabulary.

Health Check

Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

17 stars in the last 30 days

Explore Similar Projects

python-toon by xaviviro

LLM data serialization for token efficiency

Created 2 months ago

Updated 2 months ago

Starred by

Sam Partee

Sam Partee(Cofounder of Arcade).

super-json-mode by varunshenoy

Framework for accelerated structured output generation from LLMs

Created 2 years ago

Updated 1 year ago

Awesome-LLM-Constrained-Decoding by Saibo-creator

LLM constrained decoding research and resources

Created 1 year ago

Updated 2 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic) and

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

rellm by r2d4

Regex tool for language model completion

Created 2 years ago

Updated 2 years ago

Starred by

Teknium

Teknium(Cofounder of Nous Research).

antislop-sampler by sam-paech

Sampler for reducing undesirable LLM outputs via backtracking

Created 1 year ago

Updated 5 months ago

prompt-optimizer by vaibkumr

CLI tool to minimize LLM token complexity, reducing API costs

Created 2 years ago

Updated 1 year ago

syncode by structuredllm

Grammar-guided LLM generation framework ensuring syntactically valid output

Created 2 years ago

Updated 1 month ago

Starred by

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind).

local-llm-function-calling by rizerphe

Tool for local LLM function calling with JSON schema enforcement

Created 2 years ago

Updated 1 year ago

Starred by

Bryan Helmig

Bryan Helmig(Cofounder of Zapier).

llguidance by guidance-ai

Fast constrained decoding for LLMs

Created 1 year ago

Updated 1 month ago

Starred by

Tim Suchanek

Tim Suchanek(Founder of expand.ai),

Mckay Wrigley

Mckay Wrigley(Founder of Takeoff AI), and

1 more.

gpt-tokenizer by niieani

JS library for OpenAI GPT model token encoding/decoding

Created 2 years ago

Updated 1 week ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Hiroshi Shibata

Hiroshi Shibata(Core Contributor to Ruby), and

9 more.

toon by toon-format

Compact data format for LLMs

Created 2 months ago

Updated 3 days ago

Starred by

Nat Friedman

Nat Friedman(Former CEO of GitHub),

Eric Zhang

Eric Zhang(Founding Engineer at Modal), and

31 more.

tiktoken by openai

Fast BPE tokenizer for OpenAI models

Created 3 years ago

Updated 3 months ago

Feedback? Help us improve.