hiera - A Fast and Simplified Transformer for Leading Vision Tasks

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

What is Hiera?

Hiera is an innovative vision transformer designed to process and analyze images and videos with remarkable speed and efficiency. It stands out by delivering better performance than previous state-of-the-art systems, all while maintaining simplicity and speed. Hiera's design ensures that it remains fast and robust, offering advanced features without unnecessary complexity.

How Does It Work?

Traditional vision transformers like ViT (Vision Transformer) typically maintain a consistent level of spatial resolution and feature quantity throughout their network. However, this approach can be inefficient. The reason is the initial layers of image analysis require fewer features, and the final layers do not need as much spatial resolution. Hierarchical models, such as ResNet, have previously addressed this by reducing features at the start and lowering spatial resolution towards the end.

Hiera follows this hierarchical approach, enhancing it without the extra modules usually added in more complex models like Swin and MViT. These other models, while aiming for high performance, have become increasingly intricate, slowing down overall performance despite showing good theoretical FLOP (floating point operations per second) metrics.

Instead of complicating the architecture with added spatial structures, Hiera emphasizes training the model to understand these spatial elements inherently. The adoption of Masked Autoencoding (MAE) simplifies or even eliminates the extra modules many models rely on while simultaneously improving accuracy. The end result is a straightforward yet high-performing transformer, solidifying Hiera's position at the top for various image and video recognition tasks.

News and Updates

March 2, 2024: The code license became more lenient under the Apache 2.0 License.
June 12, 2023: More models (in1k) and some video examples were added to enhance capabilities.
June 1, 2023: The project had its initial release.

For those interested in detailed progress and enhancements, Hiera's changelog is available.

Installation

To install Hiera, ensure you have a recent version of torch, then proceed to install via pip:

pip install hiera-transformer

For those interested in development, or who need to use newer timm versions, consider installing from source:

git clone https://github.com/facebookresearch/hiera.git
cd hiera
python setup.py build develop

Model Zoo

The Hiera project provides various models that you can easily access through Torch Hub and Hugging Face Hub, even without the hiera-transformer package installed. For instance, you can initialize a base model pre-trained and fine-tuned on ImageNet-1K using:

model = torch.hub.load("facebookresearch/hiera", model="hiera_base_224", pretrained=True, checkpoint="mae_in1k_ft_in1k")

Hiera's models are divided into image and video models, each tailored for their respective tasks, with straightforward installation instructions and benchmarks provided.

Using Hiera

Hiera is fully set up for inference for both images and videos. The repository provides examples for easy integration and usage. Use Hiera for efficient and accurate processing of visual data, benefiting from both its model architecture and the innovative training approach it embodies.

Future Prospects

The team is actively working on adding training scripts to the repository, promising continuous improvement and accessibility for users and developers alike.

For those using Hiera in research or commercial projects, please make sure to cite the work appropriately, as outlined in the Citation section of the project documentation.

Licensing

Hiera's code and model weights carry distinct licensing terms. The code is under the Apache License 2.0, while model weights are shared under the Creative Commons Attribution-NonCommercial 4.0 International License. More details are available in the project's LICENSE documentation.

This innovative project opens up vast potentials in image and video processing through its efficient and simplified design, making it a valuable tool for both academia and industry.