lilac  by databricks

Data exploration tool for LLM dataset curation and quality control

created 2 years ago
1,049 stars

Top 36.5% on sourcepulse

GitHubView on GitHub
Project Summary

Lilac is an open-source tool designed for the exploration, curation, and quality control of datasets used in training, fine-tuning, and monitoring Large Language Models (LLMs). It targets data scientists, ML engineers, and researchers who need to improve the quality and reduce the cost of LLM data. Lilac offers interactive visualization, LLM-powered search, filtering, clustering, and annotation capabilities, running locally with a UI and Python API.

How It Works

Lilac leverages LLMs for advanced data analysis and manipulation. It allows users to compute various "signals" on dataset columns, such as language detection, PII identification, near-duplicate detection, and text statistics. Embeddings can be computed for semantic search, and a novel "concept search" feature allows for more controlled retrieval based on user-defined positive and negative examples. For computationally intensive tasks like clustering or embedding, Lilac offers an optional offload to its hosted platform, Lilac Garden, for significant speedups.

Quick Start & Requirements

Highlighted Details

  • LLM-powered interactive exploration, filtering, and clustering.
  • Signals for PII detection, language detection, near-duplicates, and text statistics.
  • Semantic and concept-based search capabilities.
  • Offload compute-intensive tasks to Lilac Garden for accelerated processing.
  • Supports loading data from HuggingFace, Parquet, CSV, JSON, LangSmith, and more.

Maintenance & Community

  • Active development with contributions from Databricks and Cohere.
  • Community support via Discord.
  • GitHub issues for bugs and feature requests.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Local clustering and embedding can be slow without a powerful GPU.
  • Lilac Garden is a hosted platform with potential costs and data privacy considerations for sensitive datasets.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

NeumAI by NeumTry

0%
858
Data platform for retrieval-augmented generation (RAG)
created 1 year ago
updated 1 year ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Feedback? Help us improve.