Dataset_Quantization  by magic-research

Research paper for dataset quantization, targeting lossless model training with compressed datasets

created 1 year ago
259 stars

Top 98.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for "Dataset Quantization" (DQ), a method for creating condensed datasets for training machine learning models with state-of-the-art compression ratios. It targets researchers and practitioners in computer vision and natural language processing seeking to reduce dataset size without performance degradation.

How It Works

DQ employs a three-stage process: dataset bin generation using submodular functions to select non-overlapping data bins, bin sampling to create a compact dataset subset, and pixel quantization/reconstruction using a Masked Autoencoder (MAE) and GradCAM to store only informative image patches. This approach allows for significant data reduction while preserving model training performance.

Quick Start & Requirements

  • Install: Clone the repo, create a conda environment (conda create -n dq python=3.9), activate it (conda activate dq), and install requirements (pip install -r requirements.txt).
  • Prerequisites: Python 3.9, PyTorch, CUDA, MAE pretrained model (mae_visualize_vit_large_ganloss.pth), and dataset paths (e.g., ~/data_cifar, ~/data_imagenet). OpenAI API key is needed for language task embedding generation.
  • Setup: Download MAE model. Data preparation involves placing datasets in specified directories.
  • Links: Official Implementation

Highlighted Details

  • Achieves 60% data reduction on ImageNet for classification, segmentation, and detection with no performance drop.
  • Compresses Alpaca's instruction tuning data by 80% for language tasks, yielding negligible performance impact on benchmarks like BBH and MMLU.
  • Utilizes submodular optimization for dataset bin selection and MAE for image reconstruction.
  • Supports both vision and language tasks.

Maintenance & Community

The project is associated with ICCV2023. Key contributors include Daquan Zhou, Kai Wang, Jianyang Gu, Xiangyu Peng, Dongze Lian, Yifan Zhang, Yang You, and Jiashi Feng.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The "TODO List" mentions "ImageNet selected indices" are still pending. The README implies that for language tasks, OpenAI API is used for embeddings, which may incur costs or require API key management.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.