PICARD: Constrained decoding research paper for language models
Top 78.7% on sourcepulse
PICARD (Parsing Incrementally for Constrained Auto-Regressive Decoding) is a library for enforcing structural constraints during text generation, particularly useful for tasks like text-to-SQL. It enables unmodified pre-trained language models to generate valid, semantically correct outputs without requiring specialized decoder architectures or retraining.
How It Works
PICARD integrates with standard beam search decoding by incrementally parsing the generated token sequences. It uses an incremental parsing library (attoparsec
) to check the validity of potential next tokens against predefined grammar rules (e.g., SQL syntax). Invalid tokens are pruned from the beam, ensuring that only syntactically correct sequences are explored. This approach avoids modifying the underlying language model and requires no additional training.
Quick Start & Requirements
git clone git@github.com:ElementAI/picard.git
cd picard
git submodule update --init --recursive
attoparsec
(via Haskell). Training configuration is optimized for GPUs with 40GB memory.tscholak/cxmefzzi
for T5-3B). Docker images are provided for development, training, and evaluation.Highlighted Details
Maintenance & Community
This project originated from Element AI and is now a ServiceNow Research project. The primary contributor is Torsten Scholak.
Licensing & Compatibility
The repository does not explicitly state a license in the README. However, the project is hosted on GitHub under the ServiceNow organization. Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
The README mentions that the default training configuration is optimized for a 40GB GPU, potentially limiting training on consumer hardware. The parsing rules are specific to a subset of SQLite syntax for the text-to-SQL task, and extending to other SQL dialects or grammars would require custom parser implementation.
1 year ago
1 day