This repository provides lab assignments for the Introduction to Data-Centric AI course at MIT. It offers practical exercises for students and researchers to explore techniques for improving machine learning model performance by focusing on data quality, labeling, and curation, rather than solely on model architecture.
How It Works
The labs cover a range of data-centric AI methodologies, including comparing data-centric vs. model-centric approaches, identifying label errors using Confident Learning, dataset curation with multiple annotators, data-centric evaluation, handling class imbalance and outliers, active learning for dataset growth, interpretability for data analysis, prompt engineering for LLMs, and data privacy via membership inference attacks. Each lab provides a focused practical application of these concepts.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The repository is maintained by the instructors of the Introduction to Data-Centric AI course. Contributions are welcomed via issues and pull requests.
Licensing & Compatibility
Licensed under the GNU Affero General Public License v3.0 or later. This is a strong copyleft license, meaning modifications and derivative works must also be made available under the AGPL. Commercial use or linking with closed-source software may be restricted due to the AGPL's requirements.
Limitations & Caveats
The labs are educational materials and may not be production-ready. Specific dependencies and execution environments might require careful setup. The AGPL license imposes significant obligations on redistribution and modification, which could be a barrier for some commercial applications.
5 months ago
1+ week