lilac  by databricks

Data exploration tool for LLM dataset curation and quality control

Created 2 years ago
1,055 stars

Top 35.8% on SourcePulse

GitHubView on GitHub
Project Summary

Lilac is an open-source tool designed for the exploration, curation, and quality control of datasets used in training, fine-tuning, and monitoring Large Language Models (LLMs). It targets data scientists, ML engineers, and researchers who need to improve the quality and reduce the cost of LLM data. Lilac offers interactive visualization, LLM-powered search, filtering, clustering, and annotation capabilities, running locally with a UI and Python API.

How It Works

Lilac leverages LLMs for advanced data analysis and manipulation. It allows users to compute various "signals" on dataset columns, such as language detection, PII identification, near-duplicate detection, and text statistics. Embeddings can be computed for semantic search, and a novel "concept search" feature allows for more controlled retrieval based on user-defined positive and negative examples. For computationally intensive tasks like clustering or embedding, Lilac offers an optional offload to its hosted platform, Lilac Garden, for significant speedups.

Quick Start & Requirements

Highlighted Details

  • LLM-powered interactive exploration, filtering, and clustering.
  • Signals for PII detection, language detection, near-duplicates, and text statistics.
  • Semantic and concept-based search capabilities.
  • Offload compute-intensive tasks to Lilac Garden for accelerated processing.
  • Supports loading data from HuggingFace, Parquet, CSV, JSON, LangSmith, and more.

Maintenance & Community

  • Active development with contributions from Databricks and Cohere.
  • Community support via Discord.
  • GitHub issues for bugs and feature requests.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Local clustering and embedding can be slow without a powerful GPU.
  • Lilac Garden is a hosted platform with potential costs and data privacy considerations for sensitive datasets.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Dominik Moritz Dominik Moritz(Research Scientist at Apple; Professor at CMU) and Casey Caruso Casey Caruso(Managing Partner of Topology Ventures).

latent-scope by enjalot

0%
726
Scientific tool for latent space investigation
Created 2 years ago
Updated 4 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

Curator by NVIDIA-NeMo

1.3%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anton Troynikov Anton Troynikov(Cofounder of Chroma), and
44 more.

llama_index by run-llama

0.3%
44k
Data framework for building LLM-powered agents
Created 2 years ago
Updated 18 hours ago
Feedback? Help us improve.