Vision-RWKV: An Overview
Vision-RWKV is an innovative approach to visual perception, designed with the RWKV-like architecture, which is known for its efficiency and scalability. This project is open-source and has been thoroughly documented and implemented by OpenGVLab.
Recent Developments
- In April 2024, Vision-RWKV expanded its capabilities by introducing support for the RWKV6 model in classification tasks, enhancing performance significantly.
- March 2024 marked the release of the Vision-RWKV code and models, allowing developers and researchers to explore its potential.
Key Features
High-Resolution Efficiency: Vision-RWKV is capable of processing high-resolution images efficiently. This is achieved through its global receptive field, which allows the model to deal with extensive image data seamlessly.
Scalability: The model has been pre-trained on large-scale datasets, which enhances its stability and ability to scale effectively.
Superior Performance: Vision-RWKV excels in classification tasks when compared to Vision Transformers (ViTs). It surpasses window-based ViTs and rivals global attention ViTs, but with lower computational requirements (measured in FLOPs) and faster speeds in tasks that involve dense predictions.
Efficient Alternative: Vision-RWKV offers a viable alternative to Vision Transformers for comprehensive vision tasks, making it an excellent backbone for various applications.
Model Information
The Vision-RWKV model zoo includes several pre-trained models that highlight its versatility. For example, the VRWKV-L model, which has been pre-trained on the ImageNet-22K dataset and fine-tuned on ImageNet-1K, demonstrates the model's extensive capabilities.
Image Classification on ImageNet-1K
Various models within the Vision-RWKV family have shown impressive results in image classification:
- VRWKV-T: With 6.2 million parameters and 1.2G FLOPs, it achieves a top-1 accuracy of 75.1%.
- VRWKV-S: Contains 23.8 million parameters providing an 80.1% accuracy.
- VRWKV-B: Boasts 93.7 million parameters and reaches an accuracy of 82.0%.
- VRWKV-L: The most robust model with 334.9 million parameters, achieving an accuracy of 86.0%.
Object Detection with Mask-RCNN
Vision-RWKV also supports object detection tasks. The models here range from VRWKV-T with a box AP of 41.7 and mask AP of 38.0, to VRWKV-L achieving a box AP of 50.6 and mask AP of 44.9, showcasing robust detection capabilities.
Semantic Segmentation with UperNet
In semantic segmentation, Vision-RWKV models excel across multiple size scales, with VRWKV-T achieving a mean Intersection over Union (mIoU) of 43.3, and VRWKV-L reaching a noteworthy 53.5 mIoU, highlighting their effectiveness in segmenting visual data.
Acknowledgements & License
Vision-RWKV is made possible by the contributions of several other open-source projects like RWKV, MMPretrain, MMDetection, MMSegmentation, ViT-Adapter, and InternImage. The project is published under the Apache 2.0 license, ensuring it remains accessible for innovation and further research.
Whether for high-resolution image processing, scalable applications, or efficient performance across various tasks, Vision-RWKV offers a competitive and open-source solution in the field of visual perception.