lilac by databricks

Data exploration tool for LLM dataset curation and quality control

Created 2 years ago

1,066 stars

Top 35.3% on SourcePulse

View on GitHub

11 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

John Resig

Author of jQuery; Chief Software Architect at Khan Academy

Casper Hansen

Author of AutoAWQ

Dan Guido

Cofounder of Trail of Bits

and 7 more!

Project Summary

Lilac is an open-source tool designed for the exploration, curation, and quality control of datasets used in training, fine-tuning, and monitoring Large Language Models (LLMs). It targets data scientists, ML engineers, and researchers who need to improve the quality and reduce the cost of LLM data. Lilac offers interactive visualization, LLM-powered search, filtering, clustering, and annotation capabilities, running locally with a UI and Python API.

How It Works

Lilac leverages LLMs for advanced data analysis and manipulation. It allows users to compute various "signals" on dataset columns, such as language detection, PII identification, near-duplicate detection, and text statistics. Embeddings can be computed for semantic search, and a novel "concept search" feature allows for more controlled retrieval based on user-defined positive and negative examples. For computationally intensive tasks like clustering or embedding, Lilac offers an optional offload to its hosted platform, Lilac Garden, for significant speedups.

Quick Start & Requirements

Install: pip install lilac[all]
Requirements: Python 3.8+, no specific hardware or GPU required for local operation, though a GPU is recommended for performance.
Demo: Lilac web demo
Documentation: Installation Guide, Loading Data, Explore

Highlighted Details

LLM-powered interactive exploration, filtering, and clustering.
Signals for PII detection, language detection, near-duplicates, and text statistics.
Semantic and concept-based search capabilities.
Offload compute-intensive tasks to Lilac Garden for accelerated processing.
Supports loading data from HuggingFace, Parquet, CSV, JSON, LangSmith, and more.

Maintenance & Community

Active development with contributions from Databricks and Cohere.
Community support via Discord.
GitHub issues for bugs and feature requests.

Licensing & Compatibility

Apache 2.0 License.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

Local clustering and embedding can be slow without a powerful GPU.
Lilac Garden is a hosted platform with potential costs and data privacy considerations for sensitive datasets.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days