Vision-RWKV - Enhancing Visual Perception Efficiency and Scalability in High-Resolution Image Processing

Vision-RWKV: An Overview

Vision-RWKV is an innovative approach to visual perception, designed with the RWKV-like architecture, which is known for its efficiency and scalability. This project is open-source and has been thoroughly documented and implemented by OpenGVLab.

Recent Developments

In April 2024, Vision-RWKV expanded its capabilities by introducing support for the RWKV6 model in classification tasks, enhancing performance significantly.
March 2024 marked the release of the Vision-RWKV code and models, allowing developers and researchers to explore its potential.

Key Features

High-Resolution Efficiency: Vision-RWKV is capable of processing high-resolution images efficiently. This is achieved through its global receptive field, which allows the model to deal with extensive image data seamlessly.

Scalability: The model has been pre-trained on large-scale datasets, which enhances its stability and ability to scale effectively.

Superior Performance: Vision-RWKV excels in classification tasks when compared to Vision Transformers (ViTs). It surpasses window-based ViTs and rivals global attention ViTs, but with lower computational requirements (measured in FLOPs) and faster speeds in tasks that involve dense predictions.

Efficient Alternative: Vision-RWKV offers a viable alternative to Vision Transformers for comprehensive vision tasks, making it an excellent backbone for various applications.

Model Information

The Vision-RWKV model zoo includes several pre-trained models that highlight its versatility. For example, the VRWKV-L model, which has been pre-trained on the ImageNet-22K dataset and fine-tuned on ImageNet-1K, demonstrates the model's extensive capabilities.

Image Classification on ImageNet-1K

Various models within the Vision-RWKV family have shown impressive results in image classification:

VRWKV-T: With 6.2 million parameters and 1.2G FLOPs, it achieves a top-1 accuracy of 75.1%.
VRWKV-S: Contains 23.8 million parameters providing an 80.1% accuracy.
VRWKV-B: Boasts 93.7 million parameters and reaches an accuracy of 82.0%.
VRWKV-L: The most robust model with 334.9 million parameters, achieving an accuracy of 86.0%.

Object Detection with Mask-RCNN

Vision-RWKV also supports object detection tasks. The models here range from VRWKV-T with a box AP of 41.7 and mask AP of 38.0, to VRWKV-L achieving a box AP of 50.6 and mask AP of 44.9, showcasing robust detection capabilities.

Semantic Segmentation with UperNet

In semantic segmentation, Vision-RWKV models excel across multiple size scales, with VRWKV-T achieving a mean Intersection over Union (mIoU) of 43.3, and VRWKV-L reaching a noteworthy 53.5 mIoU, highlighting their effectiveness in segmenting visual data.

Acknowledgements & License

Vision-RWKV is made possible by the contributions of several other open-source projects like RWKV, MMPretrain, MMDetection, MMSegmentation, ViT-Adapter, and InternImage. The project is published under the Apache 2.0 license, ensuring it remains accessible for innovation and further research.

Whether for high-resolution image processing, scalable applications, or efficient performance across various tasks, Vision-RWKV offers a competitive and open-source solution in the field of visual perception.