Data exploration tool for LLM dataset curation and quality control
Top 36.5% on sourcepulse
Lilac is an open-source tool designed for the exploration, curation, and quality control of datasets used in training, fine-tuning, and monitoring Large Language Models (LLMs). It targets data scientists, ML engineers, and researchers who need to improve the quality and reduce the cost of LLM data. Lilac offers interactive visualization, LLM-powered search, filtering, clustering, and annotation capabilities, running locally with a UI and Python API.
How It Works
Lilac leverages LLMs for advanced data analysis and manipulation. It allows users to compute various "signals" on dataset columns, such as language detection, PII identification, near-duplicate detection, and text statistics. Embeddings can be computed for semantic search, and a novel "concept search" feature allows for more controlled retrieval based on user-defined positive and negative examples. For computationally intensive tasks like clustering or embedding, Lilac offers an optional offload to its hosted platform, Lilac Garden, for significant speedups.
Quick Start & Requirements
pip install lilac[all]
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
1 day