Dataset_Quantization by magic-research

Research paper for dataset quantization, targeting lossless model training with compressed datasets

Created 2 years ago

262 stars

Top 97.2% on SourcePulse

Project Summary

This repository provides the official implementation for "Dataset Quantization" (DQ), a method for creating condensed datasets for training machine learning models with state-of-the-art compression ratios. It targets researchers and practitioners in computer vision and natural language processing seeking to reduce dataset size without performance degradation.

How It Works

DQ employs a three-stage process: dataset bin generation using submodular functions to select non-overlapping data bins, bin sampling to create a compact dataset subset, and pixel quantization/reconstruction using a Masked Autoencoder (MAE) and GradCAM to store only informative image patches. This approach allows for significant data reduction while preserving model training performance.

Quick Start & Requirements

Install: Clone the repo, create a conda environment (conda create -n dq python=3.9), activate it (conda activate dq), and install requirements (pip install -r requirements.txt).
Prerequisites: Python 3.9, PyTorch, CUDA, MAE pretrained model (mae_visualize_vit_large_ganloss.pth), and dataset paths (e.g., ~/data_cifar, ~/data_imagenet). OpenAI API key is needed for language task embedding generation.
Setup: Download MAE model. Data preparation involves placing datasets in specified directories.
Links: Official Implementation

Highlighted Details

Achieves 60% data reduction on ImageNet for classification, segmentation, and detection with no performance drop.
Compresses Alpaca's instruction tuning data by 80% for language tasks, yielding negligible performance impact on benchmarks like BBH and MMLU.
Utilizes submodular optimization for dataset bin selection and MAE for image reconstruction.
Supports both vision and language tasks.

Maintenance & Community

The project is associated with ICCV2023. Key contributors include Daquan Zhou, Kai Wang, Jianyang Gu, Xiangyu Peng, Dongze Lian, Yifan Zhang, Yang You, and Jiashi Feng.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The "TODO List" mentions "ImageNet selected indices" are still pending. The README implies that for language tasks, OpenAI API is used for embeddings, which may incur costs or require API key management.

Dataset_Quantization by magic-research

Explore Similar Projects

LLM-QAT by facebookresearch

image-gpt by teddykoker

L3C-PyTorch by fab-jul

nn_pruning by huggingface

Cosmos-Tokenizer by NVIDIA

Model-Optimizer by NVIDIA

nncf by openvinotoolkit

ao by pytorch

PINTO_model_zoo by PINTO0309

KAIR by cszn

models by onnx

DeepSpeed by deepspeedai