Data curation toolkit for LLMs
Top 36.6% on sourcepulse
NeMo Curator is a Python toolkit for scalable, GPU-accelerated data preprocessing and curation for generative AI models, targeting researchers and engineers building foundation language models, text-to-image models, and performing domain-adaptive pretraining or fine-tuning. It significantly speeds up data preparation by leveraging GPUs via Dask and RAPIDS, enabling the creation of high-quality datasets for improved model convergence.
How It Works
The framework employs a modular, pipeline-based approach, allowing users to chain various text and image processing modules. It utilizes Dask for distributed computing and RAPIDS for GPU acceleration, enabling efficient handling of large datasets. Key features include GPU-accelerated deduplication (exact, fuzzy, semantic), language identification, text cleaning, and various classification filters (quality, safety, domain) powered by fastText or Hugging Face models. Modules can be reordered and scaled across multiple nodes for increased throughput.
Quick Start & Requirements
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 day ago
1 day