dcai-lab  by dcai-course

DCAI course labs

Created 2 years ago
476 stars

Top 64.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides lab assignments for the Introduction to Data-Centric AI course at MIT. It offers practical exercises for students and researchers to explore techniques for improving machine learning model performance by focusing on data quality, labeling, and curation, rather than solely on model architecture.

How It Works

The labs cover a range of data-centric AI methodologies, including comparing data-centric vs. model-centric approaches, identifying label errors using Confident Learning, dataset curation with multiple annotators, data-centric evaluation, handling class imbalance and outliers, active learning for dataset growth, interpretability for data analysis, prompt engineering for LLMs, and data privacy via membership inference attacks. Each lab provides a focused practical application of these concepts.

Quick Start & Requirements

  • Labs are designed to be run in a Python environment. Specific instructions for each lab may vary, but generally involve cloning the repository and installing dependencies.
  • Some labs, like Lab 8, are available on Google Colab for easier execution.
  • Prerequisites typically include Python and standard data science libraries (e.g., NumPy, Pandas, Scikit-learn). Specific labs might require additional libraries or pre-trained models.

Highlighted Details

  • Covers a comprehensive curriculum of data-centric AI techniques.
  • Includes practical implementation of concepts like Confident Learning and membership inference attacks.
  • Offers labs focused on modern AI challenges such as prompt engineering for LLMs and data privacy.
  • Provides a 2023 version of the labs for historical comparison or alternative learning paths.

Maintenance & Community

The repository is maintained by the instructors of the Introduction to Data-Centric AI course. Contributions are welcomed via issues and pull requests.

Licensing & Compatibility

Licensed under the GNU Affero General Public License v3.0 or later. This is a strong copyleft license, meaning modifications and derivative works must also be made available under the AGPL. Commercial use or linking with closed-source software may be restricted due to the AGPL's requirements.

Limitations & Caveats

The labs are educational materials and may not be production-ready. Specific dependencies and execution environments might require careful setup. The AGPL license imposes significant obligations on redistribution and modification, which could be a barrier for some commercial applications.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

autolabel by refuel-ai

0.2%
2k
Python library to label text datasets using LLMs
Created 2 years ago
Updated 8 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alex Atallah Alex Atallah(Cofounder of OpenRouter, OpenSea), and
8 more.

cleanlab by cleanlab

0.1%
11k
Data-centric AI package for ML with messy data
Created 7 years ago
Updated 2 weeks ago
Feedback? Help us improve.