CVNets: A Toolkit for Training Computer Vision Networks
CVNets is a versatile library designed for training a range of computer vision models. Developed to cater to both mobile-specific and traditional computer vision tasks, CVNets supports an array of functions such as object classification, detection, semantic segmentation, and training foundation models like CLIP.
What's New?
In July 2023, CVNets released version 0.4, accompanied by exciting new features:
- Integration of ByteFormer, a transformer model that operates directly on file bytes.
- Introduction of RangeAugment, an efficient method for online data augmentation.
- Support for training and evaluating foundation models like CLIP.
- Inclusion of popular models such as Mask R-CNN, EfficientNet, Swin Transformer, and Vision Transformer (ViT).
- Enhanced support for model distillation, a process to improve model performance by transferring knowledge between models.
Installation
To get started with CVNets, Python 3.10 or later and PyTorch version 1.12.0 or newer are recommended. Installation involves setting up a virtual environment using Conda. Here's a quick start guide:
# Clone the repository
git clone [email protected]:apple/ml-cvnets.git
cd ml-cvnets
# Set up a Conda virtual environment
conda create -n cvnets python=3.10.8
conda activate cvnets
# Install required packages and CVNets
pip install -r requirements.txt -c constraints.txt
pip install --editable .
Getting Started
To begin working with CVNets, general instructions can be found here. Additional resources include:
- Model training and evaluation examples here and here.
- Guidelines for converting PyTorch models to CoreML are provided here.
Supported Models and Tasks
CVNets supports a wide range of models and tasks:
ImageNet Classification Models
CVNets is equipped to train various CNNs like MobileNet (v1, v2, v3), EfficientNet, ResNet, and RegNet, as well as Transformers like Vision Transformer, MobileViTv1/v2, and Swin Transformer.
Multimodal Classification
Models like ByteFormer are supported for processing a variety of input modalities.
Object Detection
Notable models include SSD and Mask R-CNN for identifying objects within images.
Semantic Segmentation
Models such as DeepLabv3 and PSPNet are available for dividing images into meaningful regions.
Foundation Models
Incorporates support for widely-used models such as CLIP.
Automatic Data Augmentation
CVNets provides tools like RangeAugment, AutoAugment, and RandAugment for enhancing training data.
Distillation
Techniques like soft and hard distillation are supported to refine model training.
Maintainers
The CVNets project is primarily maintained by Sachin Mehta, Maxwell Horton, Mohammad Sekhavat, and Yanzi Jin, with early contributions from Farzad Abdolhosseini.
Research Effort at Apple Using CVNets
Research at Apple leveraging CVNets has been showcased in various publications, including:
- MobileViT and MobileViTv2 focusing on efficient transformers for mobile applications.
- CVNets library for high performance in computer vision.
- RangeAugment for efficient data augmentation.
- ByteFormer, which plays a role in novel transformer architecture research.
Contributing to CVNets
Community participation is encouraged, with detailed contribution guidelines available in the contributing document. Adherence to the project's Code of Conduct is expected.
License and Citation
CVNets is open-source, and license details can be found here. If CVNets contributes to your work or research, please consider citing the following papers:
@inproceedings{mehta2022mobilevit,
title={MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
author={Sachin Mehta and Mohammad Rastegari},
booktitle={International Conference on Learning Representations},
year={2022}
}
@inproceedings{mehta2022cvnets,
author = {Mehta, Sachin and Abdolhosseini, Farzad and Rastegari, Mohammad},
title = {CVNets: High Performance Library for Computer Vision},
year = {2022},
booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
series = {MM '22}
}
CVNets stands out as a comprehensive toolkit for advancing research and development in computer vision, providing tools necessary for creating and refining both established and cutting-edge models.