Curator  by NVIDIA-NeMo

Data curation toolkit for LLMs

created 1 year ago
1,047 stars

Top 36.6% on sourcepulse

GitHubView on GitHub
Project Summary

NeMo Curator is a Python toolkit for scalable, GPU-accelerated data preprocessing and curation for generative AI models, targeting researchers and engineers building foundation language models, text-to-image models, and performing domain-adaptive pretraining or fine-tuning. It significantly speeds up data preparation by leveraging GPUs via Dask and RAPIDS, enabling the creation of high-quality datasets for improved model convergence.

How It Works

The framework employs a modular, pipeline-based approach, allowing users to chain various text and image processing modules. It utilizes Dask for distributed computing and RAPIDS for GPU acceleration, enabling efficient handling of large datasets. Key features include GPU-accelerated deduplication (exact, fuzzy, semantic), language identification, text cleaning, and various classification filters (quality, safety, domain) powered by fastText or Hugging Face models. Modules can be reordered and scaled across multiple nodes for increased throughput.

Quick Start & Requirements

Highlighted Details

  • GPU-accelerated fuzzy deduplication on 1.96T tokens completed in 0.5 hours with 32x H100 GPUs.
  • Offers multilingual text curation capabilities.
  • Includes modules for PII redaction and downstream-task decontamination.
  • Provides pre-trained classifiers for quality, safety, and domain classification.

Maintenance & Community

  • Actively developed by NVIDIA.
  • Community contributions are welcomed via CONTRIBUTING.md.
  • Blog posts and tutorials are available for guidance.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • While optional, GPU acceleration is highly recommended for performance, with specific CUDA and GPU compute capability requirements. CPU-only installation is available but significantly slower for large-scale tasks.
Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
115
Issues (30d)
39
Star History
154 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 1 day ago
Feedback? Help us improve.