ml-fastvit - FastViT Speedy and Accurate Vision Transformers Powered by Structural Reparameterization

FastViT: A Fast Hybrid Vision Transformer

FastViT is a cutting-edge project focused on developing a fast hybrid vision transformer model through innovative approaches known as structural reparameterization. This project has been presented by a team of researchers including Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, and Anurag Ranjan at ICCV 2023.

Overview of FastViT

FastViT stands out as a hybrid vision transformer model, which has been designed to efficiently handle image classification tasks. The models under this project have undergone rigorous training on the ImageNet-1K dataset and have been benchmarked on an iPhone 12 Pro using the ModelBench app. This benchmarking gives a strong indication of the model's performance and latency on mobile devices.

Key Features

Structural Reparameterization: This feature allows the model to be trained more efficiently by restructuring networks without altering the original network's output.
Hybrid Vision Transformers: Combines the best of convolutional operations and transformer architectures for improved performance.
High Accuracy with Low Latency: Designed to offer high top-1 accuracy while maintaining low latency, making it viable for real-world applications.

Installation and Setup

To start using FastViT, users can easily set up their environment with the following commands:

conda create -n fastvit python=3.9
conda activate fastvit
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt

Usage

FastViT provides a user-friendly interface for model training, fine-tuning, and inference. For instance, users can create a model with:

import torch
import models
from timm.models import create_model
from models.modules.mobileone import reparameterize_model

model = create_model("fastvit_t8")
checkpoint = torch.load('/path/to/unfused_checkpoint.pth.tar')
model.load_state_dict(checkpoint['state_dict'])
model.eval()      
model_inf = reparameterize_model(model)

Model Variants

FastViT offers a variety of models catering to different levels of accuracy and latency needs:

FastViT-T8: Offers a top-1 accuracy of 76.2% with a latency of 0.8ms.
FastViT-T12: Delivers 79.3% accuracy with a 1.2ms latency.
FastViT-S12, SA12, SA24, SA36, MA36: Each model improves upon the previous ones in terms of accuracy, catering to specific requirements.

Model Zoo

This project includes an extensive model zoo where users can access trained models on ImageNet-1K, available for direct use, or adapted for additional tasks like detection and segmentation. Pre-trained checkpoints and CoreML models are readily available for developers.

Training and Evaluation

FastViT provides detailed commands for training models using both standard and distillation methods on the ImageNet-1K dataset. Additionally, it allows for evaluation using available checkpoints ensuring users can assess model performance effectively.

Exporting Models

Users can export models into CoreML format for easy deployment in Apple systems through a simple export command:

python export_model.py --variant fastvit_t8 --output-dir /path/to/save/exported_model \
--checkpoint /path/to/pretrained_checkpoints/fastvit_t8_reparam.pth.tar

Acknowledgements and Support

FastViT's development relies on several open-source contributions, ensuring a solid foundation for its performance and reliability. The research, supported by various contributors, is documented, offering users access to resources and further information through the acknowledged datasets and tools.

In summary, FastViT is a significant advancement in the field of vision transformers, providing efficient, fast, and high-performing models suitable for a wide range of applications, particularly those requiring quick inference times on mobile devices.