Curator by NVIDIA-NeMo

Data curation toolkit for LLMs

Created 1 year ago

1,338 stars

Top 29.8% on SourcePulse

4 Experts Love This Project

hammer

Jeff Hammerbacher

Cofounder of Cloudera

shizhediao

Author of LMFlow; Research Scientist at NVIDIA

jiamings

Chief Scientist at Luma AI

nikiparmar

Cofounder and former CTO of Adept AI

Project Summary

NeMo Curator is a Python toolkit for scalable, GPU-accelerated data preprocessing and curation for generative AI models, targeting researchers and engineers building foundation language models, text-to-image models, and performing domain-adaptive pretraining or fine-tuning. It significantly speeds up data preparation by leveraging GPUs via Dask and RAPIDS, enabling the creation of high-quality datasets for improved model convergence.

How It Works

The framework employs a modular, pipeline-based approach, allowing users to chain various text and image processing modules. It utilizes Dask for distributed computing and RAPIDS for GPU acceleration, enabling efficient handling of large datasets. Key features include GPU-accelerated deduplication (exact, fuzzy, semantic), language identification, text cleaning, and various classification filters (quality, safety, domain) powered by fastText or Hugging Face models. Modules can be reordered and scaled across multiple nodes for increased throughput.

Quick Start & Requirements

Install: pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]
Prerequisites: Python 3.10+, packaging >= 22.0, Ubuntu 22.04/20.04, NVIDIA GPU (Volta/compute capability 7.0+), CUDA 12+.
Resources: Installation via PyPI, source, or NeMo Framework Container. Nightly builds for RAPIDS are available.
Docs: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/data_curation/intro.html
Examples: https://github.com/NVIDIA/NeMo-Curator/tree/main/examples

Highlighted Details

GPU-accelerated fuzzy deduplication on 1.96T tokens completed in 0.5 hours with 32x H100 GPUs.
Offers multilingual text curation capabilities.
Includes modules for PII redaction and downstream-task decontamination.
Provides pre-trained classifiers for quality, safety, and domain classification.

Maintenance & Community

Actively developed by NVIDIA.
Community contributions are welcomed via CONTRIBUTING.md.
Blog posts and tutorials are available for guidance.

Licensing & Compatibility

Apache 2.0 License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

While optional, GPU acceleration is highly recommended for performance, with specific CUDA and GPU compute capability requirements. CPU-only installation is available but significantly slower for large-scale tasks.

Health Check

Last Commit

21 hours ago

Responsiveness

1 day

Pull Requests (30d)

57

Issues (30d)

16

Star History

81 stars in the last 30 days

Explore Similar Projects

Starred by

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake),

Douwe Kiela

Douwe Kiela(Cofounder of Contextual AI), and

1 more.

lens by ContextualAI

Vision-language research paper using LLMs

Created 2 years ago

Updated 5 months ago

Open-Qwen2VL by Victorwz

Multimodal LLM pre-training and fine-tuning

Created 10 months ago

Updated 4 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

awesome-synthetic-datasets by davanstrien

Curated list of synthetic text/vision datasets and generation tools

Created 1 year ago

Updated 3 days ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Teknium

Teknium(Cofounder of Nous Research).

galactic by taylorai

Data cleaning/curation tool for unstructured text datasets

Created 2 years ago

Updated 1 year ago

lmms-finetune by zjysteven

Minimal codebase for finetuning large multimodal models

Created 1 year ago

Updated 1 month ago

Starred by

Ed Huang

Ed Huang(Cofounder of PingCAP),

Jeffrey Morgan

Jeffrey Morgan(Cofounder of Ollama), and

1 more.

onprem by amaiya

Python toolkit for on-premises LLMs applied to private data

Created 2 years ago

Updated 2 days ago

synthetic-data-generator by argilla-io

Synthetic data generator for language models

Created 1 year ago

Updated 3 months ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

John Resig

John Resig(Author of jQuery; Chief Software Architect at Khan Academy), and

9 more.

lilac by databricks

Data exploration tool for LLM dataset curation and quality control

Created 2 years ago

Updated 1 year ago

Transformers-for-NLP-and-Computer-Vision-3rd-Edition by Denis2054

Code repo for exploring Generative AI and LLMs

Created 2 years ago

Updated 5 months ago

awesome-ml by underlines

Curated list of LLM/ML/DS resources

Created 5 years ago

Updated 8 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

data-prep-kit by data-prep-kit

Data preparation toolkit for GenAI applications

Created 1 year ago

Updated 2 days ago

awesome-LLM-resources by WangRongsheng

LLM resource list (video generation, agents, coding, data, training, inference)

Created 1 year ago

Updated 1 day ago

Feedback? Help us improve.