mtt-distillation - Refined Synthetic Dataset Creation for Effective Model Training

Dataset Distillation by Matching Training Trajectories

Overview

"Dataset Distillation by Matching Training Trajectories" is an innovative project aimed at optimizing data usage in machine learning. The core goal of the project is to create a small set of synthetic images that, when used to train a model, provide test performance similar to a model trained on a much larger, complete real dataset. This process is executed by mirroring the training dynamics of a network trained with full, real data using synthetic images.

Methodology

The technique developed for dataset distillation involves training "student" networks with synthetic datasets and matching their training trajectory to that of "expert" networks trained on real data. The method measures discrepancies in network parameters between student and expert networks and employs back-propagation through student network updates to refine the synthetic images.

Wearable ImageNet

An interesting application of this dataset distillation method is in creating "tileable textures." Instead of individual images, the synthetic data is formatted as continuous textures covering different classes. These textures are seamless around the edges, making them suitable for diverse applications, such as in designing clothing patterns. This concept is referred to as "Wearable ImageNet."

Getting Started

To begin with the mtt-distillation project, users can clone the repository and set up the environment depending on their hardware using provided .yaml files. Instructions are available for downloading and configuring environments tailored to different NVIDIA GPU architectures. For those using Quadro GPUs, there might be some duplication issues, which can be resolved by running the training on a single GPU.

Generating Expert Trajectories

Expert trajectories are necessary before proceeding with dataset distillation. These are generated by training multiple ConvNet models on the CIFAR-100 dataset, using ZCA whitening, for a specified number of epochs. This process only requires one-time training of the experts, allowing for their reuse in various distillation experiments.

Distillation Process

Using the generated trajectories, users can distill datasets such as CIFAR-100 to just one image per class. By setting appropriate hyperparameters, users can customize the distillation process to meet their specific needs.

ImageNet and Texture Distillation

The distillation technique is versatile enough to handle subsets of the ImageNet dataset, condensing them into low-support synthetic sets using the procedures outlined above. Users can register new ImageNet subsets and undertake similar distillation processes. Additionally, the project supports texture distillation, enabling the generation of toroidal textures by indicating the --texture flag during the distillation.

Acknowledgments

The project acknowledges contributions and feedback from several researchers and is partially supported by various grants, including the NSF Graduate Research Fellowship and corporate grants from J.P. Morgan Chase, IBM, and SAP.

Related Work

The project builds on previous research endeavors in dataset distillation and condensation. Some related works include "Dataset Distillation," "Dataset Condensation with Gradient Matching," and "Dataset Distillation with Infinitely Wide Convolutional Networks," to name a few.

For those interested in further technical details, related publications are available for reference and citation.

In summary, the mtt-distillation project offers a powerful tool for efficiently training models with highly distilled datasets, significantly reducing data requirements while maintaining robust model performance.