Curator  by NVIDIA-NeMo

Data curation toolkit for LLMs

Created 1 year ago
1,147 stars

Top 33.6% on SourcePulse

GitHubView on GitHub
Project Summary

NeMo Curator is a Python toolkit for scalable, GPU-accelerated data preprocessing and curation for generative AI models, targeting researchers and engineers building foundation language models, text-to-image models, and performing domain-adaptive pretraining or fine-tuning. It significantly speeds up data preparation by leveraging GPUs via Dask and RAPIDS, enabling the creation of high-quality datasets for improved model convergence.

How It Works

The framework employs a modular, pipeline-based approach, allowing users to chain various text and image processing modules. It utilizes Dask for distributed computing and RAPIDS for GPU acceleration, enabling efficient handling of large datasets. Key features include GPU-accelerated deduplication (exact, fuzzy, semantic), language identification, text cleaning, and various classification filters (quality, safety, domain) powered by fastText or Hugging Face models. Modules can be reordered and scaled across multiple nodes for increased throughput.

Quick Start & Requirements

Highlighted Details

  • GPU-accelerated fuzzy deduplication on 1.96T tokens completed in 0.5 hours with 32x H100 GPUs.
  • Offers multilingual text curation capabilities.
  • Includes modules for PII redaction and downstream-task decontamination.
  • Provides pre-trained classifiers for quality, safety, and domain classification.

Maintenance & Community

  • Actively developed by NVIDIA.
  • Community contributions are welcomed via CONTRIBUTING.md.
  • Blog posts and tutorials are available for guidance.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • While optional, GPU acceleration is highly recommended for performance, with specific CUDA and GPU compute capability requirements. CPU-only installation is available but significantly slower for large-scale tasks.
Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
157
Issues (30d)
23
Star History
79 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), and
9 more.

lilac by databricks

0.1%
1k
Data exploration tool for LLM dataset curation and quality control
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.