Research paper for dataset quantization, targeting lossless model training with compressed datasets
Top 98.4% on sourcepulse
This repository provides the official implementation for "Dataset Quantization" (DQ), a method for creating condensed datasets for training machine learning models with state-of-the-art compression ratios. It targets researchers and practitioners in computer vision and natural language processing seeking to reduce dataset size without performance degradation.
How It Works
DQ employs a three-stage process: dataset bin generation using submodular functions to select non-overlapping data bins, bin sampling to create a compact dataset subset, and pixel quantization/reconstruction using a Masked Autoencoder (MAE) and GradCAM to store only informative image patches. This approach allows for significant data reduction while preserving model training performance.
Quick Start & Requirements
conda create -n dq python=3.9
), activate it (conda activate dq
), and install requirements (pip install -r requirements.txt
).mae_visualize_vit_large_ganloss.pth
), and dataset paths (e.g., ~/data_cifar
, ~/data_imagenet
). OpenAI API key is needed for language task embedding generation.Highlighted Details
Maintenance & Community
The project is associated with ICCV2023. Key contributors include Daquan Zhou, Kai Wang, Jianyang Gu, Xiangyu Peng, Dongze Lian, Yifan Zhang, Yang You, and Jiashi Feng.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The "TODO List" mentions "ImageNet selected indices" are still pending. The README implies that for language tasks, OpenAI API is used for embeddings, which may incur costs or require API key management.
1 year ago
1 day